Notebook Metadata¶
- Notebook Name: 01_eda.ipynb
- Title: MoneyLion DS Assessment
- Author: Khoon Ching Wong
- Created: 2024-09-24
- Last Modified: 2025-09-25
- Description:
This notebook performs exploratory data analysis (EDA) on loan-level datasets by merging loan attributes, underwriting records and ACH payment data using unique IDs. The goal is to prepare and validate features for downstream model training (see02_model.ipynb), where Optuna is applied for optimization to reduce institutional financial losses.
The workflow includes: data imports, data manipulation, exploratory data analysis (EDA), feature engineering - Inputs:
clarity_underwriting_variables.csvloan.csvpayment.csv
- Output:
- Masked correlation matrix:
temp/Loan-level/correlation.csv - Correlation heatmap as HTML:
temp/Loan-level/correlation_heatmap.html - Cleaned matched dataset:
temp/clean_df.parquet
- Masked correlation matrix:
- Repository/Project Link: https://github.com/wongkhoon/DS-Assessment/tree/main/MoneyLion/notebooks
Import libraries¶
import IPython.core.interactiveshell
import gc
import sys
import os
import multiprocessing
import psutil
import platform
from IPython.display import display, Markdown
import pandas as pd
import numpy as np
from functools import reduce
from collections import Counter
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
import plotly.express as px
import plotly.graph_objects as go
from itertools import product
import seaborn as sns
from dython.nominal import associations
import calendar
import pathlib, subprocess, urllib
import session_info
Display settings configuration¶
- Configure display settings for enhanced output in Jupyter notebook
# Display full output in output cell, not only the last result
IPython.core.interactiveshell.InteractiveShell.ast_node_interactivity = "all"
# Maximum rows and columns of Pandas DataFrame for current setting
#print(pd.options.display.max_rows)
#print(pd.options.display.max_columns)
# Print all the contents of a Pandas DataFrame
#pd.set_option("display.max_rows", None) # Print unlimited number of rows by setting to None, default is 10
pd.set_option("display.max_columns", None) # don't truncate columns to display all of them by setting to None
pd.set_option("display.width", None) # Auto-detect the width of DataFrame to display all columns in single line by setting to None
pd.set_option("display.max_colwidth", None) # Auto detect the maximum size of column and print contents of that column without truncation
# Reset to defaults if needed
# pd.reset_option("display.*")
Create Temporary Directory for Intermediate Files¶
Create a temp directory to store intermediate files.
Examples include:
correlation.csvfor referenceclean_df.parquetfor reloading in02_model.ipynbduring Optuna optimization, model training and reporting
# Create the directory path if it doesn't exist and raise no errors if already exist
os.makedirs("temp/Loan-level", exist_ok = True)
temp_dir = "temp/Loan-level"
Functions¶
def basic_overview_df(df, name = "data.csv"):
"""
Provide a quick overview of a given Pandas DataFrame.
Parameters
----------
df : pd.DataFrame
The DataFrame to analyze
name : str, optional
Name to display for the DataFrame (default: "data.csv")
Returns
-------
pd.DataFrame
The DataFrame with string columns trimmed
Description
-----------
- Strips leading/trailing spaces from string columns
- Reports duplicate entries
- Shows DataFrame shape
- Displays first 5 rows
- Provides basic information about columns
"""
# Trim leading and trailing spaces from string columns to ensure data consistency,
# especially when matching ID columns and preventing unintended discrepancies
for col in df.select_dtypes(include = ["object"]).columns:
df[col] = df[col].map(lambda x: x.strip() if isinstance(x, str) else x)
display(Markdown(f'<span style = "font-size: 18px; font-weight: bold;"><u>{name}</u></span>'))
print(f'- {df.duplicated().sum()} duplicate rows.')
print(f'- {df.shape[0]} entries and {df.shape[1]} columns.\n')
print(f'- First 5 entries:\n')
display(df.head()) # Use display() for better output in JupyterLab
print(f'\n- Data Information:\n')
df.info(verbose = True)
def print_dup_ids(df, col, df_name):
"""
Check ID occurences in a Pandas DataFrame column for proper data join.
Useful for checking data quality before joins and identifying potential duplicate records.
Parameters
----------
df : pd.DataFrame
Input DataFrame to check for duplicates.
col : str
Name of the column to check for duplicate values.
df_name : str, optional
Name to display in output messages.
"""
# Get matching IDs with >1 occurrence and their counts
dups = [(loanId, cnt) for loanId, cnt in Counter(df[col]).items() if cnt > 1]
# Sort duplicates by count in descending order
dups.sort(key = lambda x: x[1], reverse = True)
# Print the IDs and their counts
for loanId, cnt in dups:
print(f'{df_name}.{col}:{loanId}, Occurrences:{cnt}')
del dups
def anal_df(df):
"""
Analyze a pandas DataFrame for duplicates, shape, data types and missing values.
Provides a comprehensive overview of the DataFrame including:
- First few rows
- Duplicate row count
- Shape (rows and columns)
- Missing values analysis with proportions
- Data types for each column
Parameters
----------
df : pd.DataFrame
The DataFrame to analyze.
"""
display(Markdown(f'<span style = "font-size: 18px; font-weight: bold;"><u>DataFrame Overview</u></span>'))
print(f'- First 5 entries:')
display(df.head())
print(f'- {df.duplicated().sum()} duplicate rows.')
print(f'- {df.shape[0]} entries, {df.shape[1]} columns.')
# Get missing values and their proportion
missing_val = df.isnull().sum()
missing_prop = (missing_val / len(df)) * 100
# Get data types
dtype_series = df.dtypes
dtype_df = pd.DataFrame(dtype_series).reset_index()
dtype_df.columns = ["Column", "Dtype"]
# Combine missing values and data types into a single DataFrame
missing_df = pd.DataFrame({"Missing Values (n)": missing_val, "Proportion (%)": missing_prop})
# Include data types in the missing_df
missing_df = missing_df.join(dtype_df.set_index("Column"), on = missing_df.index)
# Drop the redundant 'key_0' column
missing_df = missing_df.drop(columns = "key_0", errors = "ignore")
# Sort the df by the number of missing values
missing_df = missing_df.sort_values(by = "Missing Values (n)", ascending = False)
# Print the results
print(f'- Check missing values and data types:')
print(missing_df.to_string())
# Clean up variables
del missing_val, missing_prop, dtype_df
def is_bool_nan_col(col):
"""
Check if a column contains only boolean values (True/False) and/or NaN values.
This function is to identify columns that can be converted to nullable boolean dtype for memory optimization and type consistency
in downstream processing.
Parameters
----------
col : pd.Series or np.ndarray
The column or array to check for boolean values.
"""
# Get unique values excluding NaN/null values
# dropna() removes all NaN/null values
# unique() returns array of unique values
# set() converts to set for efficient comparison
uniq_vals = set(col.dropna().unique())
# Check if unique values are subset of {True, False}
# <= operator for sets checks if left side is subset of right side
# Returns True if uniq_vals only contains True and/or False
# Returns True if uniq_vals is empty (all values were NaN)
return uniq_vals <= {True, False}
def boxplt_and_summary_stats(df, target_col, feat_col, title, y_min, y_max, step):
"""
Generate a boxplot and summary statistics for a feature grouped by binary target.
Creates a side-by-side boxplot comparing the distribution of a feature between two target groups (0: safe, 1: Risky),
along with comprehensive summary statistics including range and IQR.
Parameters
----------
df : pd.DataFrame
Input DataFrame containing the data. Must include both target and feature columns specified.
target_col : str
Name of the binary target column used for grouping. Expected to contain values 0 (safe) and 1 (Risky).
feat_col : str
Name of the feature column to analyze and plot. Can be numeric continuous or discrete data.
title : str
Title for the boxplot. Should be descriptive of the analysis being
performed.
y_min : int or float
Minimum value for y-axis range. Sets the lower bound of the plot.
y_max : int or float
Maximum value for y-axis range. Sets the upper bound of the plot.
step : int or float
Interval between y-axis tick marks. Determines tick spacing.
Notes
-----
- Rows with missing values in either target or feature columns are removed
- Boxplot includes:
* Notched boxes showing confidence interval of median
* Mean markers (white squares with cyan edges)
* Median lines (gold color)
* Color coding: green for safe, red for Risky
- Summary statistics include: count, mean, std, min, 25%, 50%, 75%, max, range (max-min) and IQR (75%-25%)
- Memory is explicitly cleared after plotting to prevent memory leaks
"""
# Remove rows with missing vals (NaN) in target/feature cols as Matplotlib doesn't automatically handle them
df = df[[target_col, feat_col]].dropna()
# Prepare data for boxplot: separate feature values based on the target column
cols = [df[df[target_col] == 0][feat_col].tolist(),
df[df[target_col] == 1][feat_col].tolist()]
# Create figure and axis
fig,ax = plt.subplots(figsize = (7, 5))
# Boxplot
boxplt = ax.boxplot(cols,
notch = True, # Add notch for confidence interval of the median
patch_artist = True, # Enable color filling for boxes
showmeans = True, # Display mean marker
meanprops = {"marker": "s", "markerfacecolor": "white", "markeredgecolor": "cyan"},
medianprops={"color": "gold"}) # Change median color to red
# Set x-axis labels, y-axis label and title
ax.set_xticklabels(["Safe", "Risky"], size = 12)
ax.set_xlabel("Loans", size = 12)
ax.set_ylabel(feat_col, size = 12)
ax.set_title(title, size = 14)
# Add colors to boxes
colors = ["#99FF99", "#FF9999"]
for patch, color in zip(boxplt["boxes"], colors):
patch.set_facecolor(color)
# Add legend for median and mean
ax.legend([boxplt["medians"][0], boxplt["means"][0]], ["Median", "Mean"], loc = "upper right")
# Set y-axis limits and ticks using MaxNLocator
ax.set_ylim(y_min, y_max)
ax.yaxis.set_major_locator(MaxNLocator(integer = True, prune = "lower"))
ax.set_yticks(np.arange(y_min, y_max + 1, step)) # Set ticks at intervals of 'step'
plt.show()
# Summary statistics
summary_stats = df.groupby(target_col)[feat_col].describe(include = "all")
# Rename index for better readability
summary_stats.rename(index = {0: "Safe", 1: "Risky"}, inplace = True)
# Format the summary statistics
def fmt_stats(df):
df["range"] = df["max"] - df["min"]
df["IQR"] = df["75%"] - df["25%"]
# Define formatting for each column
fmts = {"count": "{:.0f}", "mean": "{:.3f}", "std": "{:.3f}", "min": "{:.3f}",
"25%": "{:.3f}", "50%": "{:.3f}", "75%": "{:.3f}", "max": "{:.3f}",
"range": "{:.3f}", "IQR": "{:.3f}"
}
# Apply column-wise formatting
for col, fmt in fmts.items():
if col in df.columns:
df[col] = df[col].apply(lambda x: fmt.format(x) if pd.notnull(x) else x)
return df
display(Markdown(f'**- Summary Statistics:**'))
display(fmt_stats(summary_stats))
del fig, ax, cols, boxplt, colors, patch, color
def plot_stacked_bar(clean_df, feature, observed = False, dropna = False, maxtickval = 13):
"""
Create an interactive stacked bar chart for categorical feature analysis by target groups.
Generates a stacked bar chart showing the distribution of a categorical feature across binary target groups (0: Safe, 1: Risky).
Each bar shows both the count and proportion of target groups within each category.
Includes a detailed summary statistics table.
Parameters
----------
clean_df : pd.DataFrame
Input DataFrame containing the categorical feature and a 'target' column.
The target column must contain binary values (0 and 1).
feature : str
Name of the categorical column to analyze. Can be any data type that can be converted to categorical (string, numeric, etc.).
NaN values are supported.
observed : bool, default = False
If True, only show observed values for categorical groupers, improving performance with high-cardinality categorical data.
If False, show all categorical values even if they have zero counts.
dropna : bool, default = False
If True, NaN/null values in the grouping columns are excluded from the result.
If False, NaN values are treated as a separate category labeled "NaN".
maxtickval : int, default = 13
Maximum value for y-axis tick marks in thousands.
For example, 13 creates ticks from 0 to 12,000. Must be a positive integer.
Notes
-----
- Color scheme: Green (#99FF99) for safe, Red (#FF9999) for risky
- Bars are sorted by total count in descending order for better visibility
- Proportions are displayed on each segment with 2 decimal precision
- NaN values are converted to string "NaN" for proper visualization
- The function includes memory cleanup to prevent memory leaks in notebooks
- Summary table shows both counts and proportions with proper formatting
Implementation Details
---------------------
The function performs the following steps:
1. Groups data by feature and target to calculate counts
2. Calculates proportions within each feature category
3. Converts data types for proper visualization
4. Handles NaN values by converting them to a visible category
5. Creates an interactive stacked bar chart with Plotly
6. Generates a pivoted summary statistics table
7. Cleans up memory after execution
"""
# Groupby with observed parameter
df = clean_df.groupby([feature, "target"], observed = observed, dropna = dropna).size().reset_index()
# Calculate percentages
df["percentage"] = (clean_df.groupby([feature, "target"],observed = observed, dropna = dropna)
.size()
.groupby(level = 0, observed = observed, dropna = dropna)
.apply(lambda x: 100 * x / float(x.sum()))
.values
)
# Create a dictionary that maps the variable names to the desired data types
vars_typ = {feature: "category", "target": "string"}
df = df.astype(vars_typ)
# Rename columns
df.columns = [feature, "target", "Counts", "Proportion (%)"]
# Add NaN category and fill missing values
df[feature] = df[feature].cat.add_categories("NaN")
df[feature] = df[feature].fillna("NaN")
# Sort by Counts in descending order
df.sort_values(by = "Counts", ascending = False, inplace = True)
# Create the bar plot
fig = px.bar(df,
x = feature,
y = ["Counts"],
color = "target",
text = df["Proportion (%)"].apply(lambda x: "{0:1.2f}%".format(x)),
color_discrete_map = {"0": "#99FF99", "1": "#FF9999"},
category_orders = {"target": ["0", "1"]},
)
# Update layout
fig = fig.update_layout(height = 500,
width = 1000,
title_x = 0.5,
barmode = "stack",
legend = dict(yanchor = "top",
y = 0.98,
xanchor = "right",
x = 0.99,
title_text = "Loans",
title_font = dict(size = 14),
itemsizing = "constant",
traceorder = "reversed" # Reverse the order of legend items
),
)
"""
#tickvals: Contains both low-range (0-100 in steps of 10) and high-range (1000 - 24000 in steps of 1000) values.
tickvals = [i * 10 for i in range(11)] + [i * 1000 for i in range(1, 25)]
ticktext = [str(i) for i in range(0, 101, 10)] + [str(i) for i in range(1000 , 25000, 1000)]
""";
# Update y-axis with dynamic tickvals and ticktext
fig = fig.update_yaxes(title_text="Count (in 1,000)",
tickvals = [i * 1000 for i in range(0, maxtickval)], # Define tick positions corresponding to 1k, 2k, ..., 12k
ticktext = [str(i) for i in range(0, maxtickval)] # Define tick labels as 1, 2, ..., 12
)
# Update legend values
fig = fig.for_each_trace(lambda t: t.update(name = "Safe" if t.name == "0" else "Risky"))
# Show the figure
fig.show()
display(Markdown(f'**- Summary Statistics:**'))
#display(df.sort_values(by = [feature, "target"], ascending = [True, False]).reset_index(drop = True))
display(df.sort_values(by = [feature, "target"], ascending = [True, False]) # Sort by feature first, then by target (descending)
.pivot(index = feature, columns = "target", values = ["Counts", "Proportion (%)"]) # Reshape
.rename(columns = {"0": "Safe", "1": "Risky"}, level = 1)
.assign(**{"Counts": lambda x: x["Counts"].astype(int), # Convert Counts to integer
"Proportion (%)": lambda x: x["Proportion (%)"].round(3)}) # Round Proportion to 4 decimal places
.assign(total_count = lambda x: x["Counts"].sum(axis = 1)) # Compute total count per feature
.sort_values(by = "total_count", ascending = False) # Sort by total count in descending order
.drop(columns = "total_count") # Remove the temporary total_count column after sorting
.swaplevel(axis = 1) # Swap multi-index levels for better readability
.sort_index(axis = 1) # Sort columns properly
)
del fig, df, vars_typ, maxtickval
Import csv data files¶
clarity_underwriting_variables.csvloan.csvpayment.csv
"""
# Print current working directory
print("Current working directory:", os.getcwd())
""";
# Load CSV files into pandas dfs
cuv_df = pd.read_csv("./data/data/clarity_underwriting_variables.csv",
low_memory = False) # Ensure accurate data types for all columns despite the cost of increased memory usage
loan_df = pd.read_csv("./data/data/loan.csv",
parse_dates = ["applicationDate", "originatedDate"],
date_format = "ISO8601") # Up to millisecond precision -> yyyy-mm-dd hh:mm:ss.sss
payment_df = pd.read_csv("./data/data/payment.csv",
parse_dates = ["paymentDate"],
date_format = "ISO8601") # Up to millisecond precision -> yyyy-mm-dd hh:mm:ss.sss
Data overview¶
loan.csv¶
basic_overview_df(loan_df, name = "loan.csv")
loan.csv
- 0 duplicate rows. - 577682 entries and 19 columns. - First 5 entries:
| loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanStatus | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clarityFraudId | hasCF | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-07399092 | beff4989be82aab4a5b47679216942fd | B | 360.0 | 2016-02-23 17:29:01.940 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 500.0 | 978.27 | IL | bvMandatory | 6 | NaN | 5669ef78e4b0c9d3936440e6 | 1 |
| 1 | LL-I-06644937 | 464f5d9ae4fa09ece4048d949191865c | B | 199.0 | 2016-01-19 22:07:36.778 | True | 2016-01-20 15:49:18.846 | 0.0 | True | 1 | Paid Off Loan | 3000.0 | 6395.19 | CA | prescreen | 0 | Checked | 569eb3a3e4b096699f685d64 | 1 |
| 2 | LL-I-10707532 | 3c174ae9e2505a5f9ddbff9843281845 | B | 590.0 | 2016-08-01 13:51:14.709 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 400.0 | 1199.45 | MO | bvMandatory | 3 | NaN | 579eab11e4b0d0502870ef2f | 1 |
| 3 | LL-I-02272596 | 9be6f443bb97db7e95fa0c281d34da91 | B | 360.0 | 2015-08-06 23:58:08.880 | False | NaT | 0.0 | False | 0 | Withdrawn Application | 500.0 | 1074.05 | IL | bvMandatory | 3 | NaN | 555b1e95e4b0f6f11b267c18 | 1 |
| 4 | LL-I-09542882 | 63b5494f60b5c19c827c7b068443752c | B | 590.0 | 2016-06-05 22:31:34.304 | False | NaT | 0.0 | False | 0 | Rejected | 350.0 | 814.37 | NV | bvMandatory | 3 | NaN | 5754a91be4b0c6a2bf424772 | 1 |
- Data Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 577682 entries, 0 to 577681 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 577426 non-null object 1 anon_ssn 577682 non-null object 2 payFrequency 576409 non-null object 3 apr 573760 non-null float64 4 applicationDate 577682 non-null datetime64[ns] 5 originated 577682 non-null bool 6 originatedDate 46044 non-null datetime64[ns] 7 nPaidOff 577658 non-null float64 8 approved 577682 non-null bool 9 isFunded 577682 non-null int64 10 loanStatus 577291 non-null object 11 loanAmount 575432 non-null float64 12 originallyScheduledPaymentAmount 577682 non-null float64 13 state 577550 non-null object 14 leadType 577682 non-null object 15 leadCost 577682 non-null int64 16 fpStatus 51723 non-null object 17 clarityFraudId 357693 non-null object 18 hasCF 577682 non-null int64 dtypes: bool(2), datetime64[ns](2), float64(4), int64(3), object(8) memory usage: 76.0+ MB
# Check:
#loan_df[loan_df["loanId"]=="LL-I-18226935"] #yyyy-mm-dd hh:mm:ss
payment.csv¶
basic_overview_df(payment_df, name = "payment.csv")
payment.csv
- 0 duplicate rows. - 689364 entries and 9 columns. - First 5 entries:
| loanId | installmentIndex | isCollection | paymentDate | principal | fees | paymentAmount | paymentStatus | paymentReturnCode | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | LL-I-00000021 | 1 | False | 2014-12-19 05:00:00 | 22.33 | 147.28 | 169.61 | Checked | NaN |
| 1 | LL-I-00000021 | 2 | False | 2015-01-02 05:00:00 | 26.44 | 143.17 | 169.61 | Checked | NaN |
| 2 | LL-I-00000021 | 3 | False | 2015-01-16 05:00:00 | 31.30 | 138.31 | 169.61 | Checked | NaN |
| 3 | LL-I-00000021 | 4 | False | 2015-01-30 05:00:00 | 37.07 | 132.54 | 169.61 | Checked | NaN |
| 4 | LL-I-00000021 | 5 | False | 2015-02-13 05:00:00 | 43.89 | 125.72 | 169.61 | Checked | NaN |
- Data Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 689364 entries, 0 to 689363 Data columns (total 9 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 loanId 689364 non-null object 1 installmentIndex 689364 non-null int64 2 isCollection 689364 non-null bool 3 paymentDate 689364 non-null datetime64[ns] 4 principal 689364 non-null float64 5 fees 689364 non-null float64 6 paymentAmount 689364 non-null float64 7 paymentStatus 525307 non-null object 8 paymentReturnCode 31533 non-null object dtypes: bool(1), datetime64[ns](1), float64(3), int64(1), object(3) memory usage: 42.7+ MB
clarity_underwriting_variables.csv¶
# Shorten long original column names
# Prefixes to convert to cfinq., cfind., cfindvrfy.
prefix_map = {".underwritingdataclarity.clearfraud.clearfraudinquiry.": "cfinq.",
".underwritingdataclarity.clearfraud.clearfraudindicator.": "cfind.",
".underwritingdataclarity.clearfraud.clearfraudidentityverification.": "cfindvrfy."
}
cuv_df.rename(columns = lambda col:
# Replace only the first occurrence of each prefix if column starts with prefix
next((col.replace(orig, new, 1) for orig, new in prefix_map.items() if col.startswith(orig)),
col # If no prefix matches, keep the column name unchanged
),
inplace = True)
basic_overview_df(cuv_df, name = "clarity_underwriting_variables.csv")
del prefix_map
clarity_underwriting_variables.csv
- 0 duplicate rows. - 49752 entries and 54 columns. - First 5 entries:
| cfinq.thirtydaysago | cfinq.twentyfourhoursago | cfinq.oneminuteago | cfinq.onehourago | cfinq.ninetydaysago | cfinq.sevendaysago | cfinq.tenminutesago | cfinq.fifteendaysago | cfinq.threesixtyfivedaysago | cfind.inquiryonfilecurrentaddressconflict | cfind.totalnumberoffraudindicators | cfind.telephonenumberinconsistentwithaddress | cfind.inquiryageyoungerthanssnissuedate | cfind.onfileaddresscautious | cfind.inquiryaddressnonresidential | cfind.onfileaddresshighrisk | cfind.ssnreportedmorefrequentlyforanother | cfind.currentaddressreportedbytradeopenlt90days | cfind.inputssninvalid | cfind.inputssnissuedatecannotbeverified | cfind.inquiryaddresscautious | cfind.morethan3inquiriesinthelast30days | cfind.onfileaddressnonresidential | cfind.creditestablishedpriortossnissuedate | cfind.driverlicenseformatinvalid | cfind.inputssnrecordedasdeceased | cfind.inquiryaddresshighrisk | cfind.inquirycurrentaddressnotonfile | cfind.bestonfilessnissuedatecannotbeverified | cfind.highprobabilityssnbelongstoanother | cfind.maxnumberofssnswithanybankaccount | cfind.bestonfilessnrecordedasdeceased | cfind.currentaddressreportedbynewtradeonly | cfind.creditestablishedbeforeage18 | cfind.telephonenumberinconsistentwithstate | cfind.driverlicenseinconsistentwithonfile | cfind.workphonepreviouslylistedascellphone | cfind.workphonepreviouslylistedashomephone | cfindvrfy.ssnnamematch | cfindvrfy.nameaddressmatch | cfindvrfy.phonematchtype | cfindvrfy.ssnnamereasoncodedescription | cfindvrfy.phonematchresult | cfindvrfy.nameaddressreasoncodedescription | cfindvrfy.phonematchtypedescription | cfindvrfy.overallmatchresult | cfindvrfy.phonetype | cfindvrfy.ssndobreasoncode | cfindvrfy.ssnnamereasoncode | cfindvrfy.nameaddressreasoncode | cfindvrfy.ssndobmatch | cfindvrfy.overallmatchreasoncode | clearfraudscore | underwritingid | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 8.0 | 2.0 | 2.0 | 2.0 | 8.0 | 2.0 | 2.0 | 5.0 | 10.0 | False | 2.0 | True | False | False | True | False | False | False | False | False | False | False | False | False | NaN | False | False | False | False | False | 1.0 | False | False | False | False | NaN | False | False | match | partial | M | NaN | unavailable | (A8) Match to Last Name only | (M) Mobile Phone | partial | NaN | NaN | NaN | A8 | match | 6.0 | 871.0 | 54cbffcee4b0ba763e43144d |
| 1 | 5.0 | 2.0 | 2.0 | 2.0 | 11.0 | 2.0 | 2.0 | 4.0 | 21.0 | True | 3.0 | True | False | False | False | False | False | False | False | False | False | False | False | False | NaN | False | False | True | False | False | 1.0 | False | False | False | False | NaN | False | False | match | mismatch | M | NaN | unavailable | NaN | (M) Mobile Phone | partial | NaN | NaN | NaN | NaN | match | 11.0 | 397.0 | 54cc0408e4b0418d9a7f78af |
| 2 | 9.0 | 4.0 | 2.0 | 3.0 | 10.0 | 8.0 | 2.0 | 9.0 | 25.0 | False | 3.0 | True | False | False | False | False | False | False | False | False | False | False | False | False | NaN | False | False | False | False | False | 2.0 | False | False | False | False | NaN | True | False | match | match | M | NaN | unavailable | NaN | (M) Mobile Phone | match | NaN | NaN | NaN | NaN | match | 1.0 | 572.0 | 54cc0683e4b0418d9a80adb6 |
| 3 | 3.0 | 2.0 | 2.0 | 2.0 | 9.0 | 2.0 | 2.0 | 2.0 | 9.0 | False | 1.0 | True | False | False | False | False | False | False | False | False | False | False | False | False | NaN | False | False | False | False | False | 1.0 | False | False | False | False | NaN | False | False | match | mismatch | M | NaN | unavailable | NaN | (M) Mobile Phone | partial | NaN | NaN | NaN | NaN | match | 11.0 | 838.0 | 54cc0780e4b0ba763e43b74a |
| 4 | 5.0 | 5.0 | 2.0 | 2.0 | 6.0 | 5.0 | 2.0 | 5.0 | 6.0 | False | 1.0 | True | False | False | False | False | False | False | False | False | False | False | False | False | NaN | False | False | False | False | False | 1.0 | False | False | False | False | NaN | False | False | match | match | M | NaN | unavailable | NaN | (M) Mobile Phone | match | NaN | NaN | NaN | NaN | match | 1.0 | 768.0 | 54cc1d67e4b0ba763e445b45 |
- Data Information: <class 'pandas.core.frame.DataFrame'> RangeIndex: 49752 entries, 0 to 49751 Data columns (total 54 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 cfinq.thirtydaysago 49750 non-null float64 1 cfinq.twentyfourhoursago 49750 non-null float64 2 cfinq.oneminuteago 49750 non-null float64 3 cfinq.onehourago 49750 non-null float64 4 cfinq.ninetydaysago 49750 non-null float64 5 cfinq.sevendaysago 49750 non-null float64 6 cfinq.tenminutesago 49750 non-null float64 7 cfinq.fifteendaysago 49750 non-null float64 8 cfinq.threesixtyfivedaysago 49750 non-null float64 9 cfind.inquiryonfilecurrentaddressconflict 49712 non-null object 10 cfind.totalnumberoffraudindicators 49735 non-null float64 11 cfind.telephonenumberinconsistentwithaddress 49712 non-null object 12 cfind.inquiryageyoungerthanssnissuedate 49712 non-null object 13 cfind.onfileaddresscautious 49712 non-null object 14 cfind.inquiryaddressnonresidential 49712 non-null object 15 cfind.onfileaddresshighrisk 49712 non-null object 16 cfind.ssnreportedmorefrequentlyforanother 49712 non-null object 17 cfind.currentaddressreportedbytradeopenlt90days 49712 non-null object 18 cfind.inputssninvalid 49712 non-null object 19 cfind.inputssnissuedatecannotbeverified 49712 non-null object 20 cfind.inquiryaddresscautious 49712 non-null object 21 cfind.morethan3inquiriesinthelast30days 49712 non-null object 22 cfind.onfileaddressnonresidential 49712 non-null object 23 cfind.creditestablishedpriortossnissuedate 49712 non-null object 24 cfind.driverlicenseformatinvalid 44703 non-null object 25 cfind.inputssnrecordedasdeceased 49712 non-null object 26 cfind.inquiryaddresshighrisk 49712 non-null object 27 cfind.inquirycurrentaddressnotonfile 49712 non-null object 28 cfind.bestonfilessnissuedatecannotbeverified 49712 non-null object 29 cfind.highprobabilityssnbelongstoanother 49712 non-null object 30 cfind.maxnumberofssnswithanybankaccount 49735 non-null float64 31 cfind.bestonfilessnrecordedasdeceased 49712 non-null object 32 cfind.currentaddressreportedbynewtradeonly 49712 non-null object 33 cfind.creditestablishedbeforeage18 49712 non-null object 34 cfind.telephonenumberinconsistentwithstate 49071 non-null object 35 cfind.driverlicenseinconsistentwithonfile 10055 non-null object 36 cfind.workphonepreviouslylistedascellphone 21416 non-null object 37 cfind.workphonepreviouslylistedashomephone 21416 non-null object 38 cfindvrfy.ssnnamematch 49720 non-null object 39 cfindvrfy.nameaddressmatch 49720 non-null object 40 cfindvrfy.phonematchtype 48799 non-null object 41 cfindvrfy.ssnnamereasoncodedescription 2669 non-null object 42 cfindvrfy.phonematchresult 49712 non-null object 43 cfindvrfy.nameaddressreasoncodedescription 5627 non-null object 44 cfindvrfy.phonematchtypedescription 48799 non-null object 45 cfindvrfy.overallmatchresult 49720 non-null object 46 cfindvrfy.phonetype 1515 non-null object 47 cfindvrfy.ssndobreasoncode 9029 non-null object 48 cfindvrfy.ssnnamereasoncode 2669 non-null object 49 cfindvrfy.nameaddressreasoncode 5627 non-null object 50 cfindvrfy.ssndobmatch 49720 non-null object 51 cfindvrfy.overallmatchreasoncode 49720 non-null float64 52 clearfraudscore 49615 non-null float64 53 underwritingid 49752 non-null object dtypes: float64(13), object(41) memory usage: 20.5+ MB
Data merging/combination¶
- Including data preprocessing/transformation as appropriate
# Check:
# Every row represents a unique underwriting case i.e. underwritingid
#print_dup_ids(cuv_df, "underwritingid", "cuv_df")
# A unique underwriting case i.e. clarityFraudId may involve multiple loans i.e. loanId
#print_dup_ids(loan_df, "clarityFraudId", "loan_df")
#print_dup_ids(loan_df,"loanId", "loan_df")
"""
The maximum number of loans for underwriting is 15 and the minimum is 2
""";
#print_dup_ids(payment_df, "loanId", "payment_df")
"""
Maximum number of payment entries = 105, minimum is 3
""";
# Separate rows from cuv_df:with or without underwritingid to avoid incorrect merging/combination
cuv_w_id = cuv_df[cuv_df["underwritingid"].notnull()]
cuv_wo_id = cuv_df[cuv_df["underwritingid"].isnull()]
# Separate rows from loan_df:with or without clarityFraudId to avoid incorrect merging/combination
loan_w_id = loan_df[loan_df["clarityFraudId"].notnull()]
loan_wo_id = loan_df[loan_df["clarityFraudId"].isnull()]
dfs = {"w/ cuv_df.underwritingid": cuv_w_id,
"w/o cuv_df.underwritingid": cuv_wo_id,
"w/ loan_df.clarityFraudId": loan_w_id,
"w/o loan_df.clarityFraudId": loan_wo_id}
for name, df in dfs.items():
if name == "w/ clarityFraudId":
#uniq_cnt = df["clarityFraudId"].nunique() # Replace clarityFraudId with the actual ID column name
display(Markdown(f'**- {name}: {df.shape[0]} rows with total unique *clarityFraudId* of {df["clarityFraudId"].nunique()}**'))
#print(f'{start}{name}: {df.shape[0]} rows with total unique clarityFraudId of {df["clarityFraudId"].nunique()}{end}')
else:
display(Markdown(f'**- {name}: {df.shape[0]} rows**'))
#print(f'{start}{name}: {df.shape[0]} rows{end}')
del dfs, name, df
- w/ cuv_df.underwritingid: 49752 rows
- w/o cuv_df.underwritingid: 0 rows
- w/ loan_df.clarityFraudId: 357693 rows
- w/o loan_df.clarityFraudId: 219989 rows
# Retrieve rows with non-missing underwritingid and clarityFraudId from both dataframes, either matching or non-matching
merged_df = pd.merge(cuv_w_id,
loan_w_id,
left_on = "underwritingid",
right_on = "clarityFraudId",
how = "outer",
indicator = True)
# Recode merging indicator
merged_df = merged_df.rename(columns = {"_merge": "cuv_loan_ind"}).assign(cuv_loan_ind = merged_df["_merge"]
.cat
.rename_categories({"left_only": "in_cuv",
"right_only": "in_loan",
"both": "in_cuv_loan"}))
display(Markdown(f'<span style = "font-size: 18px; font-weight: bold;"><u>Initial merge</u></span>'))
merged_df.shape
merged_df.cuv_loan_ind.value_counts(dropna = False)
Initial merge
(375697, 74)
cuv_loan_ind in_loan 321359 in_cuv_loan 36334 in_cuv 18004 Name: count, dtype: int64
# Append rows from loan data with no clarityFraudId i.e. loan_wo_id
# with the pandas DataFrame above i.e. merged_df containing all rows of available underwritingid and clarityFraudId either matching or non-matching
# Assign merging indicator
loan_wo_id = loan_wo_id.copy() # To avoid SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame.
loan_wo_id.loc[:,"cuv_loan_ind"] = "in_loan"
# Combine other df with unavailable IDs containing >0 rows
cuv_loan_df = pd.concat([merged_df, loan_wo_id], ignore_index = True)
display(Markdown(f'<span style = "font-size: 18px; font-weight: bold;"><u>Final merge</u></span>'))
display(Markdown(f'**- cuv_loan_df {cuv_loan_df.shape} vs. cuv_df {cuv_df.shape} and loan_df {loan_df.shape}**'))
# print(start + "cuv_loan_df", cuv_loan_df.shape,"vs. cuv_df", cuv_df.shape, "and loan_df", loan_df.shape, end)
display(cuv_loan_df.cuv_loan_ind.value_counts(dropna = False))
del merged_df, cuv_w_id, cuv_wo_id, loan_w_id, loan_wo_id
Final merge
- cuv_loan_df (595686, 74) vs. cuv_df (49752, 54) and loan_df (577682, 19)
cuv_loan_ind in_loan 541348 in_cuv_loan 36334 in_cuv 18004 Name: count, dtype: int64
The final merge between underwriting and loan data with either matching or non-matching ID i.e. cuv_df.underwritingid and loan_df.clarityFraudId contains a total of 595686 rows and 74 (54 + 19 + 1 merge indicator = 74) columns.
- 541348 rows come solely from loan data,
- 36334 rows come from both underwriting and loan data (matching IDs found) and,
- 18004 rows come solely from underwriting data.
# According to MoneyLion Data Scientist Assessment Data Dictionary.docx,
# availability of the clarity variables depends on the underwriting flow for the lead.
pd.crosstab(cuv_loan_df["cuv_loan_ind"], cuv_loan_df["leadType"], dropna = False, margins = True)
| leadType | bvMandatory | california | express | instant-offer | lead | lionpay | organic | prescreen | rc_returning | repeat | NaN | All |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| cuv_loan_ind | ||||||||||||
| in_cuv | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 18004 | 18004 |
| in_cuv_loan | 15799 | 58 | 1 | 12 | 12075 | 2 | 6836 | 1403 | 147 | 1 | 0 | 36334 |
| in_loan | 459202 | 421 | 21 | 10 | 60598 | 24 | 16015 | 3112 | 1922 | 23 | 0 | 541348 |
| All | 475001 | 479 | 22 | 22 | 72673 | 26 | 22851 | 4515 | 2069 | 24 | 0 | 595686 |
18004 row entries are present only in the underwriting data (cuv_df.underwritingid present but no loan_df.clarityFraudId).
Add payment_df¶
Each row in this file represents an ACH attempt (either scheduled for the future or has elapsed in the past) associated to the loanId.
- Feature engineering by aggregating data at the loan level.
- Transforming the dataset from a long to a wide format.
- Assume no occurrences (i.e., both paymentStatus and paymentReturnCode) at the time of data extraction. Fill the column with zero if it contains null values after reshaping the data.
# Replace NaN with string 'NaN' and convert to string
payment_df_copy = payment_df.copy()
payment_df_copy["paymentReturnCode"] = payment_df_copy["paymentReturnCode"].fillna("NaN").astype(str)
payment_df_copy["paymentStatus"] = payment_df_copy["paymentStatus"].fillna("NaN").astype(str)
# Get unique values for ordering
return_codes = sorted(payment_df_copy["paymentReturnCode"].unique())
statuses = sorted(payment_df_copy["paymentStatus"].unique())
collections = payment_df["isCollection"].unique()
# Create all possible combinations to show zero-count bubbles
all_comb = pd.DataFrame(list(product(return_codes, statuses, collections)),
columns = ["paymentReturnCode", "paymentStatus", "isCollection"])
# Count and merge
bubble_data = (payment_df_copy.groupby(["paymentReturnCode", "paymentStatus", "isCollection"])
.size()
.reset_index(name = "count")
.merge(all_comb, how = "right")
.fillna(0))
bubble_data["count"] = bubble_data["count"].astype(int)
# Create the plot
fig = px.scatter(bubble_data,
x = "paymentReturnCode",
y = "paymentStatus",
size = "count",
color = "count",
facet_col = "isCollection",
text = "count",
color_continuous_scale = "Viridis",
size_max = 60, # Controls the size of the largest bubble
# This ensures all categories appear on the axes in every facet
category_orders = {"paymentReturnCode": return_codes, "paymentStatus": statuses}
)
# Center the title using a structured dictionary
fig.update_layout(title = dict(text = "<b>Bubble Plot of Payment Status vs Payment Return Code by Collection Plan</b>",
x = 0.5,
xanchor = "center"),
height = 600
)
# Update traces for final styling (text size, opacity, etc.)
fig.update_traces(textposition = "middle center",
textfont_size = 10,
marker = dict(sizemin = 5,
opacity = 0.3, # Opacity for clarity
line = dict(width = 1, color = "DarkSlateGrey") # Add border to bubbles
)
)
# Simple loop to format the facet titles with spaces around "="
fig.for_each_annotation(lambda a: a.update(text=a.text.replace('=', ' = ')))
fig.show();
del payment_df_copy, return_codes, statuses, collections, all_comb, bubble_data;
# Print unlimited number of rows by setting to None, default is 10
pd.set_option("display.max_rows", None)
pd.crosstab([payment_df["paymentReturnCode"], payment_df["paymentStatus"]],
payment_df["isCollection"],
dropna = False)
# Reset to default setting
pd.reset_option("display.max_rows")
| isCollection | False | True | |
|---|---|---|---|
| paymentReturnCode | paymentStatus | ||
| C01 | Cancelled | 0 | 0 |
| Checked | 87 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 0 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| C02 | Cancelled | 0 | 0 |
| Checked | 10 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 0 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| C03 | Cancelled | 0 | 0 |
| Checked | 34 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 0 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| C05 | Cancelled | 0 | 0 |
| Checked | 106 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 0 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| C07 | Cancelled | 0 | 0 |
| Checked | 2 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 0 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| LPP01 | Cancelled | 0 | 0 |
| Checked | 1 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 6 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| MISSED | Cancelled | 0 | 0 |
| Checked | 1 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 536 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R01 | Cancelled | 1 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 22865 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R02 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 2761 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R03 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 318 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R04 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 39 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R06 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 6 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R07 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 159 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 1 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R08 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 2259 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R09 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 176 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R10 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 620 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R13 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 2 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R15 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 3 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R16 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 1085 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R19 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 1 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R20 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 83 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R29 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 4 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| R99 | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 60 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RAF | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 58 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RBW | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 5 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RFG | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 3 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RIR | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 1 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RUP | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 6 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RWC | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 7 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RXL | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 1 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| RXS | Cancelled | 0 | 0 |
| Checked | 0 | 0 | |
| Complete | 0 | 0 | |
| Pending | 0 | 0 | |
| Rejected | 226 | 0 | |
| Rejected Awaiting Retry | 0 | 0 | |
| Returned | 0 | 0 | |
| Skipped | 0 | 0 | |
| NaN | 0 | 0 | |
| NaN | Cancelled | 264654 | 5679 |
| Checked | 203363 | 6017 | |
| Complete | 1 | 0 | |
| Pending | 9181 | 60 | |
| Rejected | 24 | 1016 | |
| Rejected Awaiting Retry | 0 | 18 | |
| Returned | 0 | 0 | |
| Skipped | 3761 | 0 | |
| NaN | 162952 | 1105 |
#Check:
#print("payment_df.paymentReturnCode:",list(sorted(payment_df.paymentReturnCode.astype(str).unique())))
#print("payment_df.paymentStatus:", list(sorted(payment_df.paymentStatus.astype(str).unique())))
#pd.crosstab(payment_df["isCollection"], payment_df["paymentReturnCode"].isna(), margins = True, normalize = False).rename(columns = {False: "notna", True: "isna"})
According to the bubble plot and contigency table above, C01, C02, C03, C05 and C07 in the payment data seem to be linked to successful payments, as indicated by the Checked status in paymentStatus, as documented in the data dictionary. These codes correspond to Notification of Change (NOC) Codes in the ACH system.
For more details, refer to VeriCheck's ACH Notification of Change (NOC) Codes.
ACH Return Codes (R01 – R33) are associated with Rejected payments. But there are 1,040 Rejected payments without paymentReturnCode which may indicate that those entries hadn't been updated at the time of data extraction. Similarly, codes such as RAF, RBW, RFG, RIR, RUP, RWC, RXL and RXS are not documented anywhere and could posssibly be custom codes.
For more details, refer to ACH Return Codes (R01 – R33).
Others like LPP01 and MISSED are not documented anywhere. However, both appear in the paymentStatus column, with LPP01 recorded as successful (Checked, n =1) and unsuccessful (Rejected, n = 6), and MISSED similarly recorded as successful (Checked, n = 1) and unsuccessful (Rejected, n = 536).
All the ACH error coded payment entries (n = 31533) are associated with only the non-custom made collection plan. On the other hand, the custom made collection plan if the customer has trouble making repayments as per the original schedule has no payment return code (n = 13,895) for paymentReturnCode.
Successful payments are indicated by Checked and Complete for paymentStatus.
Last payment status¶
- In comparison with
loan_df.loanStatus
# Check:
# Filter rows with the most recent paymentDate by loanId
lpymtstatus_df = payment_df[payment_df["paymentDate"] == payment_df.groupby(["loanId"])['paymentDate'].transform("max")].rename(columns={"paymentStatus": "lpymtstatus"})
df_merged = cuv_loan_df[["loanId", "applicationDate", "originated", "applicationDate", "approved", "loanAmount", "isFunded", "loanStatus"]].merge(lpymtstatus_df, on = "loanId", how = "inner")
pd.crosstab(df_merged["loanStatus"], df_merged["lpymtstatus"], dropna = False, margins = True)
del lpymtstatus_df, df_merged
| lpymtstatus | Cancelled | Checked | Pending | Rejected | Rejected Awaiting Retry | Skipped | NaN | All |
|---|---|---|---|---|---|---|---|---|
| loanStatus | ||||||||
| CSR Voided New Loan | 9 | 1 | 0 | 0 | 0 | 0 | 17 | 27 |
| Charged Off | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| Charged Off Paid Off | 176 | 2 | 0 | 3 | 0 | 0 | 9 | 190 |
| Credit Return Void | 659 | 0 | 0 | 0 | 0 | 0 | 42 | 701 |
| Customer Voided New Loan | 331 | 0 | 0 | 0 | 0 | 0 | 5 | 336 |
| Customver Voided New Loan | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| External Collection | 10896 | 275 | 0 | 200 | 3 | 0 | 2151 | 13525 |
| Internal Collection | 3075 | 31 | 0 | 20 | 0 | 0 | 2450 | 5576 |
| New Loan | 36 | 7 | 1 | 0 | 0 | 0 | 7998 | 8042 |
| Paid Off Loan | 6320 | 4444 | 85 | 2 | 0 | 3 | 689 | 11543 |
| Pending Paid Off | 17 | 15 | 0 | 0 | 0 | 0 | 137 | 169 |
| Pending Rescind | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 4 |
| Returned Item | 5 | 3 | 2 | 0 | 0 | 0 | 1173 | 1183 |
| Settled Bankruptcy | 309 | 5 | 0 | 2 | 0 | 0 | 36 | 352 |
| Settlement Paid Off | 278 | 414 | 0 | 5 | 1 | 0 | 18 | 716 |
| Settlement Pending Paid Off | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 1 |
| Voided New Loan | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 2 |
| Withdrawn Application | 4 | 0 | 0 | 0 | 0 | 0 | 1 | 5 |
| All | 22123 | 5198 | 88 | 232 | 4 | 3 | 0 | 42374 |
Discrepancies in lpymttStatus occur when the paymentStatus from the most recent paymentDate in the payment data differs from the loanStatus in the loan data. For example, there are 85 rows where loanStatus is Paid Off Loan but lpymtstatus is Pending. Based on the data dictionary, it seems the payment data doesn't reflect real-time status when loanStatus is defined as the current loan status. Nonetheless, I'll proceed with using loanStatus from the loan data, despite the lack of documentation on how it was derived. I'll also aggregate payment data at the loan level in a status-specific manner to evaluate model performance using all currently available features at this stage.
Aggregate numerical features¶
paymentStatusspecific at loan level using summary statisticsprincipalfeespaymentAmount
- days between payment entries
Looking only at loan level totals does not always tell the full story of repayment behavior. Two loans might show the same total amount paid, but one may consist mostly of completed payments while another includes several that were rejected, skipped, or returned. Even though the totals are identical, the risk behind those loans is very different. Breaking payments down by their status such as cancelled, pending, complete, rejected, skipped, awaiting retry, or returned gives a clearer picture of how each loan is being repaid.
This approach also helps capture patterns that a simple average can easily hide. Imagine a loan where payments are often missed but occasionally covered by a large lump sum. The average payment amount might look acceptable, but the irregular pattern suggests higher risk. By calculating the sum, median, standard deviation, minimum and maximum within each status, a much more complete view becomes possible. The sum shows the overall amount paid, the median reflects a typical payment without being skewed by extremes, the standard deviation highlights whether payments are steady or irregular, and the minimum and maximum show the smallest and largest payments, which can reveal unusual behavior.
Put simply, totals and averages give only part of the story, while status based measures provide the full picture. This makes it easier to see repayment patterns clearly and to identify loans that might carry more risk even when the totals look the same.
# Quick overview
payment_df.describe(include = "all").T
| count | unique | top | freq | mean | min | 25% | 50% | 75% | max | std | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| loanId | 689364 | 39952 | LL-I-12230332 | 105 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| installmentIndex | 689364.0 | NaN | NaN | NaN | 10.553222 | 1.0 | 5.0 | 9.0 | 14.0 | 105.0 | 8.04953 |
| isCollection | 689364 | 2 | False | 675469 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| paymentDate | 689364 | NaN | NaN | NaN | 2016-10-17 17:12:40.541038080 | 2014-12-09 05:00:00 | 2016-04-29 04:00:00 | 2016-12-27 05:00:00 | 2017-04-14 04:00:00 | 2021-02-26 05:00:00 | NaN |
| principal | 689364.0 | NaN | NaN | NaN | 45.557543 | -303.37 | 13.18 | 27.61 | 53.38 | 4000.0 | 81.724683 |
| fees | 689364.0 | NaN | NaN | NaN | 67.003994 | -42.56 | 28.82 | 51.3 | 86.44 | 1257.71 | 59.78951 |
| paymentAmount | 689364.0 | NaN | NaN | NaN | 112.680232 | -337.7 | 56.81 | 86.34 | 135.09 | 4063.6 | 105.78371 |
| paymentStatus | 525307 | 8 | Cancelled | 270334 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
| paymentReturnCode | 31533 | 31 | R01 | 22866 | NaN | NaN | NaN | NaN | NaN | NaN | NaN |
There are rows with principal < 0, fees < 0 and paymentAmount < 0.
How many such rows exist?
Let's check below 👇
# Find loanId with principal < 0, or fees < 0, or paymentAmount < 0
loanId_w_neg_val = payment_df[(payment_df["principal"] < 0) | (payment_df["fees"] < 0) | (payment_df["paymentAmount"] < 0)]["loanId"].unique()
display(Markdown(f'**- {loanId_w_neg_val.shape[0]} unique loanId with either principal < 0, fees < 0 or paymentAmount < 0:**'
f'<br>{", ".join(loanId_w_neg_val)}'))
display(Markdown(f'**- {payment_df[(payment_df["principal"] < 0) | (payment_df["fees"] < 0) | (payment_df["paymentAmount"] < 0)].shape[0]} payment entries with either principal < 0, or fees < 0 or paymentAmount < 0.**'))
# Filter/identify payment entries linked to specific loanId where at least one a principal, fee, or paymentAmount is negative i.e. < 0:
filtered_df = payment_df[payment_df["loanId"].isin(loanId_w_neg_val)]
display(Markdown(f'**- {filtered_df.shape[0]} payment entries associated with a loanId that have either a principal < 0, fee < 0, or paymentAmount < 0 👇.**'))
filtered_df
del loanId_w_neg_val, filtered_df
- 15 unique loanId with either principal < 0, fees < 0 or paymentAmount < 0:
LL-I-07515698, LL-I-07882270, LL-I-07918008, LL-I-07930582, LL-I-07930820, LL-I-07931827, LL-I-07942777, LL-I-07945456, LL-I-08802275, LL-I-08901334, LL-I-09026647, LL-I-12122640, LL-I-12122658, LL-I-13301264, LL-I-13303260
- 32 payment entries with either principal < 0, or fees < 0 or paymentAmount < 0.
- 325 payment entries associated with a loanId that have either a principal < 0, fee < 0, or paymentAmount < 0 👇.
| loanId | installmentIndex | isCollection | paymentDate | principal | fees | paymentAmount | paymentStatus | paymentReturnCode | |
|---|---|---|---|---|---|---|---|---|---|
| 201714 | LL-I-07515698 | 1 | False | 2016-04-01 04:00:00 | 0.00 | 71.12 | 71.12 | Checked | NaN |
| 201715 | LL-I-07515698 | 2 | False | 2016-04-08 04:00:00 | 6.59 | 45.26 | 51.85 | Checked | NaN |
| 201716 | LL-I-07515698 | 3 | False | 2016-04-15 04:00:00 | 7.34 | 44.51 | 51.85 | Checked | NaN |
| 201717 | LL-I-07515698 | 4 | False | 2016-04-22 04:00:00 | 8.17 | 43.68 | 51.85 | Checked | NaN |
| 201718 | LL-I-07515698 | 5 | False | 2016-04-29 04:00:00 | 9.09 | 42.76 | 51.85 | Checked | NaN |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 475358 | LL-I-13303260 | 23 | False | 2017-05-26 04:00:00 | 39.22 | -0.61 | 38.61 | Cancelled | NaN |
| 475359 | LL-I-13303260 | 24 | False | 2017-06-02 04:00:00 | 43.67 | -5.06 | 38.61 | Cancelled | NaN |
| 475360 | LL-I-13303260 | 25 | False | 2017-06-09 04:00:00 | 48.63 | -10.02 | 38.61 | Cancelled | NaN |
| 475361 | LL-I-13303260 | 26 | False | 2017-06-16 04:00:00 | 54.15 | -15.54 | 38.61 | Cancelled | NaN |
| 475362 | LL-I-13303260 | 27 | False | 2017-06-23 04:00:00 | -191.09 | -21.68 | -212.77 | Cancelled | NaN |
325 rows × 9 columns
With my limited understanding of this area, I’d assume the 32 payment entries showing negative values in the principal, fees, or paymentAmount column still make sense in a financial context, especially since they represent only a small portion of the overall data. That said, this should be confirmed with a SME.
Altogether, there are 325 payment entries tied to the 15 loans where either the principal, fees, or paymentAmount is negative.
# Sum principal, fees and paymentAmount for each loan when status is either Checked or Complete to check against loan_df.originallyScheduledPaymentAmount
sum_df = payment_df[payment_df["paymentStatus"].isin(["Checked", "Complete"])] \
.groupby("loanId")[["principal", "fees", "paymentAmount"]].sum() \
.rename(columns = lambda col: f'{col}_tot').reset_index()
# Handle missing values i.e. NaN in paymentStatus -> No ACH attempt has been made yet – usually because the payment is scheduled for the future according to MoneyLion Data Scientist Assessment Data Dictionary.docx
# Confirmed by email
payment_df["paymentStatus_recode"] = payment_df["paymentStatus"].fillna("None")
# Melt the DataFrame from wide to long format to unpivot columns into rows for easier aggregation on paymentStatus by loanId
melted_df = payment_df.melt(id_vars = ["loanId", "paymentStatus_recode"],
value_vars = ["principal", "fees", "paymentAmount"],
var_name = "type",
value_name = "amount").replace({"type": {"paymentAmount": "pymtAmt"}})
# Create a pivot table to aggregate data i.e. 4-way table
num_agg = melted_df.pivot_table(index = "loanId",
columns = ["type", "paymentStatus_recode"],
values = "amount",
aggfunc = ["sum", "mean", "median", "std", "count", "min", "max"],
fill_value = 0)
# Flatten the multi-level cols for numerical aggregation
num_agg.columns = ["_".join(col).replace("median", "med").replace("count", "cnt").strip() for col in num_agg.columns.values]
num_agg.reset_index(inplace = True)
del melted_df
# Sort by loanId and paymentDate
payment_df.sort_values(by = ["loanId", "paymentDate"], inplace = True)
# Calculate the difference in days between consecutive payments for each loanId
payment_df["days_btw_pymts"] = payment_df.groupby("loanId")["paymentDate"].diff().dt.days
# Fill NaN values in days_between_payments with 0 (for the frst payment)
payment_df["days_btw_pymts"] = payment_df["days_btw_pymts"].fillna(0)
# Aggregate paymentDate with custom column names by loanId
days_btw_pymts = payment_df.groupby("loanId")["days_btw_pymts"].agg(sum_days_btw_pymts = "sum",
mean_days_btw_pymts = "mean",
med_days_btw_pymts = "median",
std_days_btw_pymts = "std",
cnt_days_btw_pymts = "count",
min_days_btw_pymts = "min",
max_days_btw_pymts = "max").reset_index()
Aggregate categorical features¶
isCollectionpaymentStatuspaymentReturnCode
# Recode according to MoneyLion Data Scientist Assessment Data Dictionary.docx i.e. True is custom collection
payment_df["isCollection_recode"] = payment_df["isCollection"].map({True: "custom", False: "non custom"})
# List categorical features
cat_feat = ["isCollection_recode", "paymentStatus_recode", "paymentReturnCode"]
# Aggregate categorical features by counting occurrences of each category
cat_cnts_df = []
for feat in cat_feat:
# Aggregate categorical features by counting occurrences of each category
cat_cnts = payment_df.groupby("loanId")[feat].value_counts().unstack(fill_value = 0)
# Prepend prefixes based on the col name
if feat == "isCollection_recode":
cat_cnts.columns=[f'cnt_{col}' for col in cat_cnts.columns]
elif feat == "paymentStatus_recode":
cat_cnts.columns=[f'cnt_pymtStatus_{col}' for col in cat_cnts.columns]
elif feat == "paymentReturnCode":
cat_cnts.columns = [f'cnt_pymtRCode_{col}' for col in cat_cnts.columns]
# Handle cases where no paymentReturnCode exists for a loanId
cat_cnts = cat_cnts.reindex(payment_df["loanId"].unique(),fill_value=0)
#cat_cnts.info(verbose = True)
# Append the modified DataFrame to the list
cat_cnts_df.append(cat_cnts)
# Concatenate categorical counts for all categorical features
cat_agg = pd.concat(cat_cnts_df, axis = 1).reset_index()
del cat_feat, cat_cnts, cat_cnts_df
Merge loan-level aggregated payment Pandas DataFrames¶
# Loan-specific totals for each of principal, fees and payment_amount
dfs = [sum_df, days_btw_pymts, num_agg, cat_agg]
# Merge all dataframes on loanId using inner join
merged_df = reduce(lambda left, right: pd.merge(left, right, on = "loanId", how = "inner"), dfs)
del dfs, sum_df, days_btw_pymts, num_agg, cat_agg
+ First payment¶
Conditioning on payment amount > 0:
- first payment date
- first payment amount
- first payment status
# Check loanId with > 1 row of identical paymentDate and installmentIndex == 1
# Earliest paymentDate for each ID
earliest_dates = payment_df.groupby("loanId")["paymentDate"].transform("min")
# Filter rows where paymentDate is the earliest and installmentIndex == 1
filtered_df = payment_df[(payment_df["paymentDate"] == earliest_dates) & (payment_df["installmentIndex"] == 1)]
# Group by loanId and paymentDate, then count the number of rows for each loanId and paymentDate
ids_w_dup = filtered_df.groupby(["loanId", "paymentDate"]).size()
# Get loanId and earliest paymentDate with > 1 row
ids_w_mult_rows = ids_w_dup[ids_w_dup > 1].index.tolist()
display(Markdown(f'**loanId with >1 row for the earliest paymentDate and installmentIndex == 1: {ids_w_mult_rows}.**'))
loanId with >1 row for the earliest paymentDate and installmentIndex == 1: [('LL-I-04451435', Timestamp('2015-11-27 05:00:00'))].
"""
# Check:
payment_df[(payment_df["loanId"] == "LL-I-00344987") & (payment_df["installmentIndex"] == 1)]
filtered_df[filtered_df["loanId"] == "LL-I-00344987"]
payment_df[(payment_df["loanId"] == "LL-I-04451435") & (payment_df["installmentIndex"] == 1)]
filtered_df[filtered_df["loanId"] == "LL-I-04451435"] # 1 x custom and 1 x non-custom collection, loan_df.fpstatus takes "non-custom" entry i.e. paymentStatus = Cancelled
del earliest_dates, filtered_df, ids_w_dup, ids_w_mult_rows
""";
"""
# Check:
# Find the earliest payment date for each loanId
earliest_dates = payment_df.groupby("loanId")["paymentDate"].min().reset_index()
# Merge with the original DataFrame to get all rows with the earliest date
temp_df = payment_df.merge(earliest_dates, on = ["loanId", "paymentDate"], how = "inner")
# Count occurrences of each loanId
id_cnts = temp_df['loanId'].value_counts().reset_index()
id_cnts.columns = ["loanId", "count"]
# Find the maximum occurrence count
max_cnt = id_cnts["count"].max()
print("\nCount of occurrences for each loanId:")
print(id_cnts)
print("\nCount of >1 occurrences for each loanId:")
print(id_cnts[id_cnts["count"] > 1])
print(f'\nMaximum occurrences in loanId: {max_cnt}')
del earliest_dates, temp_df, id_cnts, max_cnt
# 189 unique loandId have > 1 record with the same earliest paymentDate, with up to 3 entries sharing that paymentDate
""";
# For each loanId, retrieve the row with the earliest paymentDate where paymentAmount > 0, keep ["loanId", "paymentDate", "paymentAmount", "paymentStatus_recode"],
# and rename them accordingly
earliest_df = (payment_df[payment_df["paymentAmount"] > 0]
.loc[payment_df[payment_df["paymentAmount"] > 0]
.groupby("loanId")["paymentDate"]
.idxmin(),
["loanId", "paymentDate", "paymentAmount", "paymentStatus_recode"]]
.rename(columns = {"paymentDate": "fpymtDate",
"paymentAmount": "fpymtAmt",
"paymentStatus_recode": "fpymtStatus"}));
"""
# Check: if there are multiple rows with the earliest payment date for each loanId
# Count total number of rows for each loanId in earliest_df and sort by descending order
row_cnts_sorted = earliest_df.groupby("loanId").size().reset_index(name = "total_rows").sort_values(by = "total_rows", ascending = False)
row_cnts_sorted[["total_rows"]].describe(include = "all").T # only one row
""";
"""
count mean std min 25% 50% 75% max
total_rows 39952.0 1.0 0.0 1.0 1.0 1.0 1.0 1.0
""";
"""
# Check:
earliest_df[earliest_df["loanId"] == "LL-I-12556329"]
earliest_df[earliest_df["loanId"] == "LL-I-04451435"]
loan_df[loan_df["loanId"] == "LL-I-04451435"]
""";
# Merge with previous wide payment df -> merged_df
agg_pymt_df = pd.merge(merged_df, earliest_df, on = "loanId", how = "outer", indicator = True)
# Check:
#agg_pymt_df._merge.value_counts(dropna = False)
"""
_merge
both 39952
left_only 0
right_only 0
Name: count, dtype: int64
"""
# Drop merging indicator
agg_pymt_df.drop(columns = "_merge", inplace = True)
#print("Aggregated payment df:")
#agg_pymt_df.head()
display(Markdown(f'**Total unique loanId from provided payment_df: {payment_df["loanId"].nunique(dropna = False)}**'))
display(Markdown(f'**Against the following:**'))
agg_pymt_df.info(verbose = True)
del merged_df, earliest_df;
Total unique loanId from provided payment_df: 39952
Against the following:
<class 'pandas.core.frame.DataFrame'> RangeIndex: 39952 entries, 0 to 39951 Data columns (total 239 columns): # Column Dtype --- ------ ----- 0 loanId object 1 principal_tot float64 2 fees_tot float64 3 paymentAmount_tot float64 4 sum_days_btw_pymts float64 5 mean_days_btw_pymts float64 6 med_days_btw_pymts float64 7 std_days_btw_pymts float64 8 cnt_days_btw_pymts float64 9 min_days_btw_pymts float64 10 max_days_btw_pymts float64 11 sum_fees_Cancelled float64 12 sum_fees_Checked float64 13 sum_fees_Complete float64 14 sum_fees_None float64 15 sum_fees_Pending float64 16 sum_fees_Rejected float64 17 sum_fees_Rejected Awaiting Retry float64 18 sum_fees_Returned float64 19 sum_fees_Skipped float64 20 sum_principal_Cancelled float64 21 sum_principal_Checked float64 22 sum_principal_Complete float64 23 sum_principal_None float64 24 sum_principal_Pending float64 25 sum_principal_Rejected float64 26 sum_principal_Rejected Awaiting Retry float64 27 sum_principal_Returned float64 28 sum_principal_Skipped float64 29 sum_pymtAmt_Cancelled float64 30 sum_pymtAmt_Checked float64 31 sum_pymtAmt_Complete float64 32 sum_pymtAmt_None float64 33 sum_pymtAmt_Pending float64 34 sum_pymtAmt_Rejected float64 35 sum_pymtAmt_Rejected Awaiting Retry float64 36 sum_pymtAmt_Returned float64 37 sum_pymtAmt_Skipped float64 38 mean_fees_Cancelled float64 39 mean_fees_Checked float64 40 mean_fees_Complete float64 41 mean_fees_None float64 42 mean_fees_Pending float64 43 mean_fees_Rejected float64 44 mean_fees_Rejected Awaiting Retry float64 45 mean_fees_Returned float64 46 mean_fees_Skipped float64 47 mean_principal_Cancelled float64 48 mean_principal_Checked float64 49 mean_principal_Complete float64 50 mean_principal_None float64 51 mean_principal_Pending float64 52 mean_principal_Rejected float64 53 mean_principal_Rejected Awaiting Retry float64 54 mean_principal_Returned float64 55 mean_principal_Skipped float64 56 mean_pymtAmt_Cancelled float64 57 mean_pymtAmt_Checked float64 58 mean_pymtAmt_Complete float64 59 mean_pymtAmt_None float64 60 mean_pymtAmt_Pending float64 61 mean_pymtAmt_Rejected float64 62 mean_pymtAmt_Rejected Awaiting Retry float64 63 mean_pymtAmt_Returned float64 64 mean_pymtAmt_Skipped float64 65 med_fees_Cancelled float64 66 med_fees_Checked float64 67 med_fees_Complete float64 68 med_fees_None float64 69 med_fees_Pending float64 70 med_fees_Rejected float64 71 med_fees_Rejected Awaiting Retry float64 72 med_fees_Returned float64 73 med_fees_Skipped float64 74 med_principal_Cancelled float64 75 med_principal_Checked float64 76 med_principal_Complete float64 77 med_principal_None float64 78 med_principal_Pending float64 79 med_principal_Rejected float64 80 med_principal_Rejected Awaiting Retry float64 81 med_principal_Returned float64 82 med_principal_Skipped float64 83 med_pymtAmt_Cancelled float64 84 med_pymtAmt_Checked float64 85 med_pymtAmt_Complete float64 86 med_pymtAmt_None float64 87 med_pymtAmt_Pending float64 88 med_pymtAmt_Rejected float64 89 med_pymtAmt_Rejected Awaiting Retry float64 90 med_pymtAmt_Returned float64 91 med_pymtAmt_Skipped float64 92 std_fees_Cancelled float64 93 std_fees_Checked float64 94 std_fees_None float64 95 std_fees_Pending float64 96 std_fees_Rejected float64 97 std_fees_Rejected Awaiting Retry float64 98 std_fees_Skipped float64 99 std_principal_Cancelled float64 100 std_principal_Checked float64 101 std_principal_None float64 102 std_principal_Pending float64 103 std_principal_Rejected float64 104 std_principal_Rejected Awaiting Retry float64 105 std_principal_Skipped float64 106 std_pymtAmt_Cancelled float64 107 std_pymtAmt_Checked float64 108 std_pymtAmt_None float64 109 std_pymtAmt_Pending float64 110 std_pymtAmt_Rejected float64 111 std_pymtAmt_Rejected Awaiting Retry float64 112 std_pymtAmt_Skipped float64 113 cnt_fees_Cancelled float64 114 cnt_fees_Checked float64 115 cnt_fees_Complete float64 116 cnt_fees_None float64 117 cnt_fees_Pending float64 118 cnt_fees_Rejected float64 119 cnt_fees_Rejected Awaiting Retry float64 120 cnt_fees_Returned float64 121 cnt_fees_Skipped float64 122 cnt_principal_Cancelled float64 123 cnt_principal_Checked float64 124 cnt_principal_Complete float64 125 cnt_principal_None float64 126 cnt_principal_Pending float64 127 cnt_principal_Rejected float64 128 cnt_principal_Rejected Awaiting Retry float64 129 cnt_principal_Returned float64 130 cnt_principal_Skipped float64 131 cnt_pymtAmt_Cancelled float64 132 cnt_pymtAmt_Checked float64 133 cnt_pymtAmt_Complete float64 134 cnt_pymtAmt_None float64 135 cnt_pymtAmt_Pending float64 136 cnt_pymtAmt_Rejected float64 137 cnt_pymtAmt_Rejected Awaiting Retry float64 138 cnt_pymtAmt_Returned float64 139 cnt_pymtAmt_Skipped float64 140 min_fees_Cancelled float64 141 min_fees_Checked float64 142 min_fees_Complete float64 143 min_fees_None float64 144 min_fees_Pending float64 145 min_fees_Rejected float64 146 min_fees_Rejected Awaiting Retry float64 147 min_fees_Returned float64 148 min_fees_Skipped float64 149 min_principal_Cancelled float64 150 min_principal_Checked float64 151 min_principal_Complete float64 152 min_principal_None float64 153 min_principal_Pending float64 154 min_principal_Rejected float64 155 min_principal_Rejected Awaiting Retry float64 156 min_principal_Returned float64 157 min_principal_Skipped float64 158 min_pymtAmt_Cancelled float64 159 min_pymtAmt_Checked float64 160 min_pymtAmt_Complete float64 161 min_pymtAmt_None float64 162 min_pymtAmt_Pending float64 163 min_pymtAmt_Rejected float64 164 min_pymtAmt_Rejected Awaiting Retry float64 165 min_pymtAmt_Returned float64 166 min_pymtAmt_Skipped float64 167 max_fees_Cancelled float64 168 max_fees_Checked float64 169 max_fees_Complete float64 170 max_fees_None float64 171 max_fees_Pending float64 172 max_fees_Rejected float64 173 max_fees_Rejected Awaiting Retry float64 174 max_fees_Returned float64 175 max_fees_Skipped float64 176 max_principal_Cancelled float64 177 max_principal_Checked float64 178 max_principal_Complete float64 179 max_principal_None float64 180 max_principal_Pending float64 181 max_principal_Rejected float64 182 max_principal_Rejected Awaiting Retry float64 183 max_principal_Returned float64 184 max_principal_Skipped float64 185 max_pymtAmt_Cancelled float64 186 max_pymtAmt_Checked float64 187 max_pymtAmt_Complete float64 188 max_pymtAmt_None float64 189 max_pymtAmt_Pending float64 190 max_pymtAmt_Rejected float64 191 max_pymtAmt_Rejected Awaiting Retry float64 192 max_pymtAmt_Returned float64 193 max_pymtAmt_Skipped float64 194 cnt_custom float64 195 cnt_non custom float64 196 cnt_pymtStatus_Cancelled float64 197 cnt_pymtStatus_Checked float64 198 cnt_pymtStatus_Complete float64 199 cnt_pymtStatus_None float64 200 cnt_pymtStatus_Pending float64 201 cnt_pymtStatus_Rejected float64 202 cnt_pymtStatus_Rejected Awaiting Retry float64 203 cnt_pymtStatus_Returned float64 204 cnt_pymtStatus_Skipped float64 205 cnt_pymtRCode_C01 float64 206 cnt_pymtRCode_C02 float64 207 cnt_pymtRCode_C03 float64 208 cnt_pymtRCode_C05 float64 209 cnt_pymtRCode_C07 float64 210 cnt_pymtRCode_LPP01 float64 211 cnt_pymtRCode_MISSED float64 212 cnt_pymtRCode_R01 float64 213 cnt_pymtRCode_R02 float64 214 cnt_pymtRCode_R03 float64 215 cnt_pymtRCode_R04 float64 216 cnt_pymtRCode_R06 float64 217 cnt_pymtRCode_R07 float64 218 cnt_pymtRCode_R08 float64 219 cnt_pymtRCode_R09 float64 220 cnt_pymtRCode_R10 float64 221 cnt_pymtRCode_R13 float64 222 cnt_pymtRCode_R15 float64 223 cnt_pymtRCode_R16 float64 224 cnt_pymtRCode_R19 float64 225 cnt_pymtRCode_R20 float64 226 cnt_pymtRCode_R29 float64 227 cnt_pymtRCode_R99 float64 228 cnt_pymtRCode_RAF float64 229 cnt_pymtRCode_RBW float64 230 cnt_pymtRCode_RFG float64 231 cnt_pymtRCode_RIR float64 232 cnt_pymtRCode_RUP float64 233 cnt_pymtRCode_RWC float64 234 cnt_pymtRCode_RXL float64 235 cnt_pymtRCode_RXS float64 236 fpymtDate datetime64[ns] 237 fpymtAmt float64 238 fpymtStatus object dtypes: datetime64[ns](1), float64(236), object(2) memory usage: 72.8+ MB
# Check:
#agg_pymt_df[agg_pymt_df["loanId"] == "LP-I-00000145"]
#payment_df[payment_df["loanId"] == "LL-I-00000231"]
#agg_pymt_df[agg_pymt_df["loanId"] == "LL-I-00000231"]
cuv + loan + payment (loan level)¶
- underwriting data (with or without matching
underwritingid) - loan data (with or without matching
clarityFraudId) and, - aggregated payment data (with or without matching
loanId)
combined_df = pd.merge(cuv_loan_df, agg_pymt_df, on = "loanId", how = "outer", indicator = True)
del cuv_loan_df, agg_pymt_df;
combined_df.shape
combined_df._merge.value_counts(dropna = False)
#combined_df[["underwritingid", "clarityFraudId", "loanId", "cuv_loan_ind", "_merge"]].head(15)
(595686, 313)
_merge left_only 555734 both 39952 right_only 0 Name: count, dtype: int64
There are a total of 595686 rows and 310 columns after merging rows from underwriting, loan and payment at the loan level.
- 555734 rows come from both underwriting and loan data
- 39952 rows come from all three data
- Zero rows from payment data alone
combined_df.groupby(["cuv_loan_ind", "_merge"], observed = False).size().unstack(fill_value = 0)
# Define conditions
cond=[(combined_df["cuv_loan_ind"] == "in_cuv") & (combined_df["_merge"] == "left_only"),
(combined_df["cuv_loan_ind"] == "in_cuv_loan") & (combined_df["_merge"] == "left_only"),
(combined_df["cuv_loan_ind"] == "in_cuv_loan") & (combined_df["_merge"] == "both"),
(combined_df["cuv_loan_ind"] == "in_loan") & (combined_df["_merge"] == "left_only"),
(combined_df["cuv_loan_ind"] == "in_loan") & (combined_df["_merge"] == "both")]
# Define corresponding values
ind = ["in_cuv", "in_cuv_loan", "in_cuv_loan_pay", "in_loan", "in_loan_pay"]
# Create cuv_loan_pay_ind indicator based on conditions to indicate whether row exists in cuv and/or, loan and/or payment according to respective ID
combined_df["cuv_loan_pay_ind"] = np.select(cond, ind, default = None)
# Check:
#combined_df[["underwritingid", "clarityFraudId", "loanId", "cuv_loan_ind", "_merge", "cuv_loan_pay_ind"]].sample(15)
combined_df.groupby(["cuv_loan_ind", "cuv_loan_pay_ind"], observed = False).size().unstack(fill_value = 0)
del cond, ind
| _merge | left_only | right_only | both |
|---|---|---|---|
| cuv_loan_ind | |||
| in_cuv | 18004 | 0 | 0 |
| in_cuv_loan | 4022 | 0 | 32312 |
| in_loan | 533708 | 0 | 7640 |
| cuv_loan_pay_ind | in_cuv | in_cuv_loan | in_cuv_loan_pay | in_loan | in_loan_pay |
|---|---|---|---|---|---|
| cuv_loan_ind | |||||
| in_cuv | 18004 | 0 | 0 | 0 | 0 |
| in_cuv_loan | 0 | 4022 | 32312 | 0 | 0 |
| in_loan | 0 | 0 | 0 | 533708 | 7640 |
# Check missing data in the columns of interest
# ID and indicator columns are dropped, as they are not of interest
cols_to_keep = combined_df.drop(columns = ["underwritingid", "clarityFraudId", "loanId", "cuv_loan_ind", "_merge", "cuv_loan_pay_ind"]).columns
display(Markdown(f'**{combined_df[cols_to_keep].isnull().any(axis = 1).sum()} rows with at least one missing value, ignoring indicator columns and underwritingid/clarityFraudId/loanId.** '
f'**<br>This includes all rows from both matched and unmatched underwritingid/clarityFraudId/loanId.**'
)
)
del cols_to_keep;
595686 rows with at least one missing value, ignoring indicator columns and underwritingid/clarityFraudId/loanId.
This includes all rows from both matched and unmatched underwritingid/clarityFraudId/loanId.
Matching data¶
- To leverage all three data:
- cuv_df, loan_df and payment_df by only including rows with matching
cuv_df.underwritingid,loan_df.clarityFraudIdandloanIdto align information accurately to prevent data integrity issues.
- cuv_df, loan_df and payment_df by only including rows with matching
match_df = combined_df[combined_df["cuv_loan_pay_ind"] == "in_cuv_loan_pay"].drop(columns = ["cuv_loan_ind", "_merge", "cuv_loan_pay_ind"])
# Reset index
match_df.reset_index(drop = True, inplace = True)
Boolean-like features¶
Convert features with only the following values 👇 to boolean from object Dtype for better memory efficiency
- True, False and NaN
- 0, 1 and NaN
# Identify columns with object Dtype that contain only True, False and NaN values
bool_obj_cols = [col for col in match_df.select_dtypes(include = ["object"]).columns if is_bool_nan_col(match_df[col])]
"""
match_df[bool_obj_cols].info()
# Inspect values of bool_obj_cols
for col in match_df[bool_obj_cols].columns:
print(match_df[col].value_counts(dropna = False),"\n")
del col
""";
# Loop through the identified bool_obj_cols and assign nullable boolean Dtype i.e. True/False/<NA>
for col in bool_obj_cols:
match_df[col] = match_df[col].astype("boolean")
"""
# Check
match_df[bool_obj_cols].info()
# Inspect Dtype of post processed bool_obj_cols
for col in match_df[bool_obj_cols].columns:
print(match_df[col].value_counts(dropna = False),"\n")
""";
# Identify other features with only 0, 1 and NaN
bin_feat = [col for col in match_df.columns if match_df[col].dtype in [np.int64, np.float64] and \
set(match_df[col].dropna().unique()) <= {0, 1}]
display(Markdown(f'**Features with only 0, 1 and NaN:**<br>{bin_feat}'))
del bin_feat
Features with only 0, 1 and NaN:
['isFunded', 'hasCF', 'min_days_btw_pymts', 'sum_fees_Complete', 'sum_fees_Returned', 'sum_principal_Complete', 'sum_principal_Returned', 'sum_pymtAmt_Complete', 'sum_pymtAmt_Returned', 'mean_fees_Complete', 'mean_fees_Returned', 'mean_principal_Complete', 'mean_principal_Returned', 'mean_pymtAmt_Complete', 'mean_pymtAmt_Returned', 'med_fees_Complete', 'med_fees_Returned', 'med_principal_Complete', 'med_principal_Returned', 'med_pymtAmt_Complete', 'med_pymtAmt_Returned', 'cnt_fees_Complete', 'cnt_fees_Returned', 'cnt_principal_Complete', 'cnt_principal_Returned', 'cnt_pymtAmt_Complete', 'cnt_pymtAmt_Returned', 'min_fees_Complete', 'min_fees_Returned', 'min_principal_Complete', 'min_principal_Returned', 'min_pymtAmt_Complete', 'min_pymtAmt_Returned', 'max_fees_Complete', 'max_fees_Returned', 'max_principal_Complete', 'max_principal_Returned', 'max_pymtAmt_Complete', 'max_pymtAmt_Returned', 'cnt_pymtStatus_Complete', 'cnt_pymtStatus_Returned', 'cnt_pymtRCode_C01', 'cnt_pymtRCode_C02', 'cnt_pymtRCode_C03', 'cnt_pymtRCode_LPP01', 'cnt_pymtRCode_R04', 'cnt_pymtRCode_R13', 'cnt_pymtRCode_R15', 'cnt_pymtRCode_R19', 'cnt_pymtRCode_R20', 'cnt_pymtRCode_R29', 'cnt_pymtRCode_RBW', 'cnt_pymtRCode_RFG', 'cnt_pymtRCode_RIR', 'cnt_pymtRCode_RUP', 'cnt_pymtRCode_RWC', 'cnt_pymtRCode_RXL']
# Convert the identified features to nullable boolean Dtype
cols_to_keep = ["isFunded", "hasCF"]
match_df[cols_to_keep] = match_df[cols_to_keep].astype("boolean")
del cols_to_keep
Integer-like features¶
Convert features containing only integers to integer from float Dtype for logical consistency and memory efficiency
# Identify float-type columns that contain only whole numbers (i.e., integers stored as floats)
# by checking if the remainder of division by 1 (mod 1) is 0 for all non-null values
float_int_feat = [col for col in match_df.select_dtypes(include = ["float"]).columns
if match_df[col].dropna().mod(1).eq(0).all()]
display(Markdown(f'**Features with float Dtype but take integer values:**<br>{float_int_feat}'))
Features with float Dtype but take integer values:
['cfinq.thirtydaysago', 'cfinq.twentyfourhoursago', 'cfinq.oneminuteago', 'cfinq.onehourago', 'cfinq.ninetydaysago', 'cfinq.sevendaysago', 'cfinq.tenminutesago', 'cfinq.fifteendaysago', 'cfinq.threesixtyfivedaysago', 'cfind.totalnumberoffraudindicators', 'cfind.maxnumberofssnswithanybankaccount', 'cfindvrfy.overallmatchreasoncode', 'clearfraudscore', 'nPaidOff', 'leadCost', 'sum_days_btw_pymts', 'cnt_days_btw_pymts', 'min_days_btw_pymts', 'max_days_btw_pymts', 'sum_fees_Complete', 'sum_fees_Returned', 'sum_principal_Complete', 'sum_principal_Returned', 'sum_pymtAmt_Complete', 'sum_pymtAmt_Returned', 'mean_fees_Complete', 'mean_fees_Returned', 'mean_principal_Complete', 'mean_principal_Returned', 'mean_pymtAmt_Complete', 'mean_pymtAmt_Returned', 'med_fees_Complete', 'med_fees_Returned', 'med_principal_Complete', 'med_principal_Returned', 'med_pymtAmt_Complete', 'med_pymtAmt_Returned', 'cnt_fees_Cancelled', 'cnt_fees_Checked', 'cnt_fees_Complete', 'cnt_fees_None', 'cnt_fees_Pending', 'cnt_fees_Rejected', 'cnt_fees_Rejected Awaiting Retry', 'cnt_fees_Returned', 'cnt_fees_Skipped', 'cnt_principal_Cancelled', 'cnt_principal_Checked', 'cnt_principal_Complete', 'cnt_principal_None', 'cnt_principal_Pending', 'cnt_principal_Rejected', 'cnt_principal_Rejected Awaiting Retry', 'cnt_principal_Returned', 'cnt_principal_Skipped', 'cnt_pymtAmt_Cancelled', 'cnt_pymtAmt_Checked', 'cnt_pymtAmt_Complete', 'cnt_pymtAmt_None', 'cnt_pymtAmt_Pending', 'cnt_pymtAmt_Rejected', 'cnt_pymtAmt_Rejected Awaiting Retry', 'cnt_pymtAmt_Returned', 'cnt_pymtAmt_Skipped', 'min_fees_Complete', 'min_fees_Returned', 'min_principal_Complete', 'min_principal_Returned', 'min_pymtAmt_Complete', 'min_pymtAmt_Returned', 'max_fees_Complete', 'max_fees_Returned', 'max_principal_Complete', 'max_principal_Returned', 'max_pymtAmt_Complete', 'max_pymtAmt_Returned', 'cnt_custom', 'cnt_non custom', 'cnt_pymtStatus_Cancelled', 'cnt_pymtStatus_Checked', 'cnt_pymtStatus_Complete', 'cnt_pymtStatus_None', 'cnt_pymtStatus_Pending', 'cnt_pymtStatus_Rejected', 'cnt_pymtStatus_Rejected Awaiting Retry', 'cnt_pymtStatus_Returned', 'cnt_pymtStatus_Skipped', 'cnt_pymtRCode_C01', 'cnt_pymtRCode_C02', 'cnt_pymtRCode_C03', 'cnt_pymtRCode_C05', 'cnt_pymtRCode_C07', 'cnt_pymtRCode_LPP01', 'cnt_pymtRCode_MISSED', 'cnt_pymtRCode_R01', 'cnt_pymtRCode_R02', 'cnt_pymtRCode_R03', 'cnt_pymtRCode_R04', 'cnt_pymtRCode_R06', 'cnt_pymtRCode_R07', 'cnt_pymtRCode_R08', 'cnt_pymtRCode_R09', 'cnt_pymtRCode_R10', 'cnt_pymtRCode_R13', 'cnt_pymtRCode_R15', 'cnt_pymtRCode_R16', 'cnt_pymtRCode_R19', 'cnt_pymtRCode_R20', 'cnt_pymtRCode_R29', 'cnt_pymtRCode_R99', 'cnt_pymtRCode_RAF', 'cnt_pymtRCode_RBW', 'cnt_pymtRCode_RFG', 'cnt_pymtRCode_RIR', 'cnt_pymtRCode_RUP', 'cnt_pymtRCode_RWC', 'cnt_pymtRCode_RXL', 'cnt_pymtRCode_RXS']
# Convert features from float to nullable Int32 to save memory without losing precision
cols_to_keep = ["cfinq.thirtydaysago", "cfinq.twentyfourhoursago", "cfinq.oneminuteago", "cfinq.onehourago", "cfinq.ninetydaysago",
"cfinq.sevendaysago", "cfinq.tenminutesago", "cfinq.fifteendaysago", "cfinq.threesixtyfivedaysago",
"cfind.totalnumberoffraudindicators", "cfind.maxnumberofssnswithanybankaccount",
"nPaidOff"] \
+ [col for col in match_df.columns if col.startswith("cnt_")]
# Convert selected columns to numeric type, coercing errors i.e. invalid values become NaN,
# and then convert them to Pandas nullable integer type i.e. Int32 to handle missing values properly
match_df[cols_to_keep] = match_df[cols_to_keep].apply(pd.to_numeric, errors = "coerce").astype("Int32")
del cols_to_keep
Categorical features¶
- Convert features from object or numerical data types to categorical Dtype to improve memory efficiency and ensure logical consistency.
display(Markdown(f'**Object Dtype columns:**<br>{[col for col in match_df.select_dtypes(include = ["object"]).columns]}\n\n'
f'**Non-object Dtype columns ending with code:**<br>{[col for col in match_df.select_dtypes(exclude = ["object"]).columns if col.endswith("code")]}'))
Object Dtype columns:
['cfindvrfy.ssnnamematch', 'cfindvrfy.nameaddressmatch', 'cfindvrfy.phonematchtype', 'cfindvrfy.ssnnamereasoncodedescription', 'cfindvrfy.phonematchresult', 'cfindvrfy.nameaddressreasoncodedescription', 'cfindvrfy.phonematchtypedescription', 'cfindvrfy.overallmatchresult', 'cfindvrfy.phonetype', 'cfindvrfy.ssndobreasoncode', 'cfindvrfy.ssnnamereasoncode', 'cfindvrfy.nameaddressreasoncode', 'cfindvrfy.ssndobmatch', 'underwritingid', 'loanId', 'anon_ssn', 'payFrequency', 'loanStatus', 'state', 'leadType', 'fpStatus', 'clarityFraudId', 'fpymtStatus']
Non-object Dtype columns ending with code:
['cfindvrfy.overallmatchreasoncode']
# Capture both object Dtype columns and non-object Dtype columns ending with "code" in one list
cols_to_keep = ([col for col in match_df.select_dtypes(include = ["object"]).columns] +
[col for col in match_df.select_dtypes(exclude = ["object"]).columns if col.endswith("code")])
display(Markdown("**Columns to convert to category Dtype:**"))
for col in cols_to_keep:
if col not in {"underwritingid", "loanId", "anon_ssn", "clarityFraudId"}:
print(match_df[col].value_counts(dropna = False), "\n")
# From float to nullable interger
match_df["cfindvrfy.overallmatchreasoncode"] = match_df["cfindvrfy.overallmatchreasoncode"].astype("Int32")
# Assign category to object Dtype columns
for col in match_df[cols_to_keep].drop(columns = ["underwritingid", "loanId", "anon_ssn", "clarityFraudId"]).columns:
match_df[col] = match_df[col].astype("category")
del cols_to_keep, col
Columns to convert to category Dtype:
cfindvrfy.ssnnamematch
match 28876
partial 2138
mismatch 1057
unavailable 207
NaN 26
invalid 8
Name: count, dtype: int64
cfindvrfy.nameaddressmatch
match 12163
mismatch 11660
unavailable 4118
partial 3624
invalid 721
NaN 26
Name: count, dtype: int64
cfindvrfy.phonematchtype
M 29725
U 1072
NaN 612
FA 491
LA 129
A 126
F 93
L 32
P 32
Name: count, dtype: int64
cfindvrfy.ssnnamereasoncodedescription
NaN 30551
(S03) SSN match to address only 1576
(S07) SSN Match to last name only 185
Name: count, dtype: int64
cfindvrfy.phonematchresult
unavailable 30829
match 713
invalid 444
partial 158
mismatch 134
NaN 34
Name: count, dtype: int64
cfindvrfy.nameaddressreasoncodedescription
NaN 28688
(A8) Match to Last Name only 3624
Name: count, dtype: int64
cfindvrfy.phonematchtypedescription
(M) Mobile Phone 29725
(U) Unlisted 1072
NaN 612
(FA) Full Name and Address 491
(LA) Last Name and Address 129
(A) Address Only 126
(F) Full Name Only 93
(L) Last Name Only 32
(P) Pager 32
Name: count, dtype: int64
cfindvrfy.overallmatchresult
partial 22658
match 9392
other 173
mismatch 63
NaN 26
Name: count, dtype: int64
cfindvrfy.phonetype
NaN 31307
R 950
B 43
MU 12
Name: count, dtype: int64
cfindvrfy.ssndobreasoncode
NaN 26469
D07 2894
D04 1358
D03 812
D01 593
D02 164
D06 22
Name: count, dtype: int64
cfindvrfy.ssnnamereasoncode
NaN 30551
S03 1576
S07 185
Name: count, dtype: int64
cfindvrfy.nameaddressreasoncode
NaN 28688
A8 3624
Name: count, dtype: int64
cfindvrfy.ssndobmatch
match 25838
partial 4485
invalid 1573
mismatch 324
unavailable 66
NaN 26
Name: count, dtype: int64
payFrequency
B 18759
W 8888
S 2170
M 2077
I 418
Name: count, dtype: int64
loanStatus
External Collection 9335
Paid Off Loan 9086
New Loan 6529
Internal Collection 5134
Returned Item 1051
Settlement Paid Off 536
Settled Bankruptcy 283
Pending Paid Off 112
Charged Off Paid Off 109
Credit Return Void 70
Customer Voided New Loan 47
CSR Voided New Loan 16
Withdrawn Application 3
Charged Off 1
Name: count, dtype: int64
state
OH 5017
IL 4577
TX 2203
WI 1840
MO 1795
FL 1648
MI 1513
IN 1502
CA 1452
VA 1299
NC 1218
TN 1183
PA 1076
NJ 1000
SC 647
AZ 533
NV 506
CO 448
MN 275
KY 264
AL 236
NM 207
LA 200
CT 191
UT 189
WA 185
MS 179
GA 137
OK 118
KS 115
IA 102
SD 83
DE 69
WY 66
NE 60
ID 46
RI 44
HI 44
AK 27
ND 18
Name: count, dtype: int64
leadType
bvMandatory 14625
lead 11231
organic 4950
prescreen 1308
rc_returning 137
california 49
instant-offer 8
lionpay 2
repeat 1
express 1
Name: count, dtype: int64
fpStatus
Checked 27049
Rejected 4827
Cancelled 171
NaN 141
Skipped 121
Pending 3
Name: count, dtype: int64
fpymtStatus
Checked 24823
Rejected 4292
None 1639
Pending 1238
Cancelled 198
Skipped 122
Name: count, dtype: int64
cfindvrfy.overallmatchreasoncode
1.0 9392
11.0 8039
16.0 2956
6.0 2487
12.0 1528
...
64.0 2
74.0 1
43.0 1
69.0 1
34.0 1
Name: count, Length: 74, dtype: int64
Identical columns¶
- value-based cross-checking between columns
# Initialize a list to store groups of identical columns
identical_cols = []
# Iterate over each column in the pandas DataFrame
for col in match_df.columns:
# Check if this column has already been checked
if any(col in grp for grp in identical_cols):
continue # Skip the rest of the loop and move to the next column
# Identify columns that contain exactly the same values as 'col' (excluding 'col' itself)
grp = [col] + [other_col for other_col in match_df.columns if other_col != col and match_df[col].equals(match_df[other_col])]
# If there are identical columns, add them to the list
if len(grp) > 1:
identical_cols.append(grp)
# Print the identical columns if found
if identical_cols:
display(Markdown('**Identical columns found:**'))
for grp in identical_cols:
print("-"*30)
print(", ".join(grp))
del identical_cols, col
Identical columns found:
------------------------------ cfind.inputssninvalid, cfind.inputssnrecordedasdeceased, cfind.bestonfilessnissuedatecannotbeverified, cfind.bestonfilessnrecordedasdeceased ------------------------------ underwritingid, clarityFraudId ------------------------------ originated, approved ------------------------------ principal_tot, sum_principal_Checked ------------------------------ fees_tot, sum_fees_Checked ------------------------------ paymentAmount_tot, sum_pymtAmt_Checked ------------------------------ min_days_btw_pymts, sum_fees_Complete, sum_fees_Returned, sum_principal_Complete, sum_principal_Returned, sum_pymtAmt_Complete, sum_pymtAmt_Returned, mean_fees_Complete, mean_fees_Returned, mean_principal_Complete, mean_principal_Returned, mean_pymtAmt_Complete, mean_pymtAmt_Returned, med_fees_Complete, med_fees_Returned, med_principal_Complete, med_principal_Returned, med_pymtAmt_Complete, med_pymtAmt_Returned, min_fees_Complete, min_fees_Returned, min_principal_Complete, min_principal_Returned, min_pymtAmt_Complete, min_pymtAmt_Returned, max_fees_Complete, max_fees_Returned, max_principal_Complete, max_principal_Returned, max_pymtAmt_Complete, max_pymtAmt_Returned ------------------------------ sum_principal_Rejected Awaiting Retry, max_principal_Rejected Awaiting Retry ------------------------------ mean_fees_Rejected Awaiting Retry, med_fees_Rejected Awaiting Retry ------------------------------ mean_principal_Rejected Awaiting Retry, med_principal_Rejected Awaiting Retry ------------------------------ mean_pymtAmt_Rejected Awaiting Retry, med_pymtAmt_Rejected Awaiting Retry ------------------------------ cnt_fees_Cancelled, cnt_principal_Cancelled, cnt_pymtAmt_Cancelled, cnt_pymtStatus_Cancelled ------------------------------ cnt_fees_Checked, cnt_principal_Checked, cnt_pymtAmt_Checked, cnt_pymtStatus_Checked ------------------------------ cnt_fees_Complete, cnt_fees_Returned, cnt_principal_Complete, cnt_principal_Returned, cnt_pymtAmt_Complete, cnt_pymtAmt_Returned, cnt_pymtStatus_Complete, cnt_pymtStatus_Returned, cnt_pymtRCode_R13, cnt_pymtRCode_RXL ------------------------------ cnt_fees_None, cnt_principal_None, cnt_pymtAmt_None, cnt_pymtStatus_None ------------------------------ cnt_fees_Pending, cnt_principal_Pending, cnt_pymtAmt_Pending, cnt_pymtStatus_Pending ------------------------------ cnt_fees_Rejected, cnt_principal_Rejected, cnt_pymtAmt_Rejected, cnt_pymtStatus_Rejected ------------------------------ cnt_fees_Rejected Awaiting Retry, cnt_principal_Rejected Awaiting Retry, cnt_pymtAmt_Rejected Awaiting Retry, cnt_pymtStatus_Rejected Awaiting Retry ------------------------------ cnt_fees_Skipped, cnt_principal_Skipped, cnt_pymtAmt_Skipped, cnt_pymtStatus_Skipped
# Filter underwriting columns starting with specific prefixes and sort them alphabetically
cols_to_keep = sorted([col for col in match_df.columns if col.startswith(("cfinq", "cfind", "cfindvrfy"))])
# Iterate through the selected columns to compute counts and proportions
for col in cols_to_keep:
# Calculate counts and proportions
val_cnts = match_df[col].value_counts(dropna = False)
prop = (val_cnts / len(match_df)).round(4)*100
# Combine counts and proportions into a pandas DataFrame
summary = pd.DataFrame({"Counts": val_cnts, "Proportions (%)": prop})
print(summary)
print("-"*80)
del cols_to_keep, col, val_cnts, prop, summary
Counts Proportions (%)
cfind.bestonfilessnissuedatecannotbeverified
False 32278 99.89
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.bestonfilessnrecordedasdeceased
False 32278 99.89
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.creditestablishedbeforeage18
False 31953 98.89
True 325 1.01
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.creditestablishedpriortossnissuedate
False 32148 99.49
True 130 0.4
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.currentaddressreportedbynewtradeonly
False 29918 92.59
True 2360 7.3
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.currentaddressreportedbytradeopenlt90days
False 31435 97.29
True 843 2.61
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.driverlicenseformatinvalid
False 25050 77.53
True 3850 11.92
<NA> 3412 10.56
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.driverlicenseinconsistentwithonfile
<NA> 25926 80.24
False 6023 18.64
True 363 1.12
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.highprobabilityssnbelongstoanother
False 32006 99.05
True 272 0.84
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inputssninvalid
False 32278 99.89
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inputssnissuedatecannotbeverified
False 32225 99.73
True 53 0.16
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inputssnrecordedasdeceased
False 32278 99.89
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inquiryaddresscautious
False 32270 99.87
<NA> 34 0.11
True 8 0.02
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inquiryaddresshighrisk
False 31897 98.72
True 381 1.18
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inquiryaddressnonresidential
False 27619 85.48
True 4659 14.42
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inquiryageyoungerthanssnissuedate
False 32188 99.62
True 90 0.28
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inquirycurrentaddressnotonfile
False 28752 88.98
True 3526 10.91
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.inquiryonfilecurrentaddressconflict
False 24462 75.71
True 7816 24.19
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.maxnumberofssnswithanybankaccount
1 22149 68.55
2 7797 24.13
3 1071 3.31
4 268 0.83
5 120 0.37
... ... ...
493 1 0.0
689 1 0.0
196 1 0.0
680 1 0.0
144 1 0.0
[357 rows x 2 columns]
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.morethan3inquiriesinthelast30days
False 31058 96.12
True 1220 3.78
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.onfileaddresscautious
False 32276 99.89
<NA> 34 0.11
True 2 0.01
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.onfileaddresshighrisk
False 31950 98.88
True 328 1.02
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.onfileaddressnonresidential
False 30080 93.09
True 2198 6.8
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.ssnreportedmorefrequentlyforanother
False 31930 98.82
True 348 1.08
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.telephonenumberinconsistentwithaddress
True 29708 91.94
False 2570 7.95
<NA> 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.telephonenumberinconsistentwithstate
False 29199 90.37
True 2664 8.24
<NA> 449 1.39
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.totalnumberoffraudindicators
1 11211 34.7
2 9324 28.86
3 5966 18.46
4 3167 9.8
0 1133 3.51
5 1117 3.46
6 299 0.93
7 66 0.2
<NA> 17 0.05
8 12 0.04
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.workphonepreviouslylistedascellphone
<NA> 17476 54.09
False 12435 38.48
True 2401 7.43
--------------------------------------------------------------------------------
Counts Proportions (%)
cfind.workphonepreviouslylistedashomephone
<NA> 17476 54.09
False 13975 43.25
True 861 2.66
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.nameaddressmatch
match 12163 37.64
mismatch 11660 36.09
unavailable 4118 12.74
partial 3624 11.22
invalid 721 2.23
NaN 26 0.08
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.nameaddressreasoncode
NaN 28688 88.78
A8 3624 11.22
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.nameaddressreasoncodedescription
NaN 28688 88.78
(A8) Match to Last Name only 3624 11.22
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.overallmatchreasoncode
1 9392 29.07
11 8039 24.88
16 2956 9.15
6 2487 7.70
12 1528 4.73
... ... ...
24 2 0.01
69 1 0.00
74 1 0.00
34 1 0.00
43 1 0.00
[74 rows x 2 columns]
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.overallmatchresult
partial 22658 70.12
match 9392 29.07
other 173 0.54
mismatch 63 0.19
NaN 26 0.08
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.phonematchresult
unavailable 30829 95.41
match 713 2.21
invalid 444 1.37
partial 158 0.49
mismatch 134 0.41
NaN 34 0.11
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.phonematchtype
M 29725 91.99
U 1072 3.32
NaN 612 1.89
FA 491 1.52
LA 129 0.40
A 126 0.39
F 93 0.29
L 32 0.10
P 32 0.10
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.phonematchtypedescription
(M) Mobile Phone 29725 91.99
(U) Unlisted 1072 3.32
NaN 612 1.89
(FA) Full Name and Address 491 1.52
(LA) Last Name and Address 129 0.40
(A) Address Only 126 0.39
(F) Full Name Only 93 0.29
(L) Last Name Only 32 0.10
(P) Pager 32 0.10
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.phonetype
NaN 31307 96.89
R 950 2.94
B 43 0.13
MU 12 0.04
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.ssndobmatch
match 25838 79.96
partial 4485 13.88
invalid 1573 4.87
mismatch 324 1.00
unavailable 66 0.20
NaN 26 0.08
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.ssndobreasoncode
NaN 26469 81.92
D07 2894 8.96
D04 1358 4.20
D03 812 2.51
D01 593 1.84
D02 164 0.51
D06 22 0.07
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.ssnnamematch
match 28876 89.37
partial 2138 6.62
mismatch 1057 3.27
unavailable 207 0.64
NaN 26 0.08
invalid 8 0.02
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.ssnnamereasoncode
NaN 30551 94.55
S03 1576 4.88
S07 185 0.57
--------------------------------------------------------------------------------
Counts Proportions (%)
cfindvrfy.ssnnamereasoncodedescription
NaN 30551 94.55
(S03) SSN match to address only 1576 4.88
(S07) SSN Match to last name only 185 0.57
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.fifteendaysago
3 8349 25.84
4 4498 13.92
5 3926 12.15
6 2794 8.65
2 2518 7.79
... ... ...
60 1 0.0
68 1 0.0
50 1 0.0
65 1 0.0
<NA> 1 0.0
[63 rows x 2 columns]
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.ninetydaysago
3 4536 14.04
5 3200 9.9
4 3133 9.7
6 2706 8.37
7 2057 6.37
... ... ...
94 1 0.0
83 1 0.0
113 1 0.0
100 1 0.0
<NA> 1 0.0
[115 rows x 2 columns]
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.onehourago
3 11289 34.94
2 4483 13.87
4 4359 13.49
5 3362 10.4
1 2959 9.16
6 2048 6.34
7 1114 3.45
8 774 2.4
9 517 1.6
10 359 1.11
11 245 0.76
12 182 0.56
13 139 0.43
14 106 0.33
15 100 0.31
16 62 0.19
17 50 0.15
19 30 0.09
18 29 0.09
20 23 0.07
22 20 0.06
21 12 0.04
25 10 0.03
24 10 0.03
23 6 0.02
27 6 0.02
26 5 0.02
0 3 0.01
29 2 0.01
31 2 0.01
33 2 0.01
32 1 0.0
35 1 0.0
28 1 0.0
<NA> 1 0.0
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.oneminuteago
1 14639 45.31
3 11323 35.04
4 2525 7.81
2 1453 4.5
5 1443 4.47
6 650 2.01
7 144 0.45
8 62 0.19
9 34 0.11
10 22 0.07
11 8 0.02
12 4 0.01
0 3 0.01
14 1 0.0
<NA> 1 0.0
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.sevendaysago
3 9575 29.63
4 4666 14.44
5 3849 11.91
2 3184 9.85
6 2679 8.29
7 1798 5.56
8 1353 4.19
9 1022 3.16
10 763 2.36
11 576 1.78
1 462 1.43
12 429 1.33
13 355 1.1
14 273 0.84
15 245 0.76
16 167 0.52
17 166 0.51
18 129 0.4
19 91 0.28
20 79 0.24
21 75 0.23
22 63 0.19
23 46 0.14
24 37 0.11
25 32 0.1
26 27 0.08
28 24 0.07
27 23 0.07
29 21 0.06
32 15 0.05
30 12 0.04
31 11 0.03
33 9 0.03
36 8 0.02
34 8 0.02
35 7 0.02
55 4 0.01
0 3 0.01
38 3 0.01
37 3 0.01
48 2 0.01
42 2 0.01
39 2 0.01
47 2 0.01
54 2 0.01
41 1 0.0
58 1 0.0
44 1 0.0
57 1 0.0
40 1 0.0
64 1 0.0
43 1 0.0
49 1 0.0
63 1 0.0
<NA> 1 0.0
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.tenminutesago
3 11543 35.72
1 6322 19.57
2 4390 13.59
4 3769 11.66
5 2660 8.23
6 1584 4.9
7 695 2.15
8 477 1.48
9 273 0.84
10 185 0.57
11 116 0.36
12 89 0.28
13 51 0.16
14 41 0.13
15 34 0.11
16 27 0.08
18 12 0.04
17 12 0.04
20 7 0.02
19 6 0.02
21 4 0.01
22 4 0.01
23 3 0.01
0 3 0.01
35 1 0.0
25 1 0.0
32 1 0.0
27 1 0.0
<NA> 1 0.0
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.thirtydaysago
3 6930 21.45
4 4110 12.72
5 3826 11.84
6 2826 8.75
7 2088 6.46
... ... ...
59 1 0.0
77 1 0.0
73 1 0.0
76 1 0.0
<NA> 1 0.0
[76 rows x 2 columns]
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.threesixtyfivedaysago
3 2256 6.98
5 2031 6.29
4 1983 6.14
6 1952 6.04
7 1628 5.04
... ... ...
326 1 0.0
280 1 0.0
160 1 0.0
279 1 0.0
<NA> 1 0.0
[213 rows x 2 columns]
--------------------------------------------------------------------------------
Counts Proportions (%)
cfinq.twentyfourhoursago
3 10730 33.21
4 4616 14.29
2 4105 12.7
5 3664 11.34
6 2360 7.3
7 1418 4.39
1 1348 4.17
8 1028 3.18
9 729 2.26
10 537 1.66
11 372 1.15
12 280 0.87
13 231 0.71
14 163 0.5
15 160 0.5
17 100 0.31
16 99 0.31
18 66 0.2
19 60 0.19
20 42 0.13
21 34 0.11
22 31 0.1
23 22 0.07
27 19 0.06
25 18 0.06
24 17 0.05
26 14 0.04
29 8 0.02
33 5 0.02
32 5 0.02
31 5 0.02
28 4 0.01
30 4 0.01
34 3 0.01
36 3 0.01
0 3 0.01
48 1 0.0
60 1 0.0
35 1 0.0
57 1 0.0
39 1 0.0
44 1 0.0
58 1 0.0
41 1 0.0
<NA> 1 0.0
--------------------------------------------------------------------------------
"""
pd.crosstab(loan_df["originated"], loan_df["approved"], dropna = False, margins = True)
pd.crosstab(match_df["originated"], match_df["approved"], dropna = False, margins = True)
"""
"""
pd.crosstab(cuv_df["cfind.inputssninvalid"], cuv_df["cfind.bestonfilessnissuedatecannotbeverified"], dropna = False, margins = True)
pd.crosstab(match_df["cfind.inputssninvalid"], match_df["cfind.bestonfilessnissuedatecannotbeverified"], dropna = False, margins = True)
"""
"""
pd.crosstab(cuv_df["cfind.bestonfilessnrecordedasdeceased"], cuv_df["cfind.bestonfilessnissuedatecannotbeverified"], dropna = False, margins = True)
pd.crosstab(match_df["cfind.bestonfilessnrecordedasdeceased"], match_df["cfind.bestonfilessnissuedatecannotbeverified"], dropna = False, margins = True)
pd.crosstab(cuv_df["cfind.bestonfilessnrecordedasdeceased"], cuv_df["cfind.inputssnrecordedasdeceased"], dropna = False, margins = True)
pd.crosstab(match_df["cfind.bestonfilessnrecordedasdeceased"], match_df["cfind.inputssnrecordedasdeceased"], dropna = False, margins = True)
pd.crosstab(cuv_df["cfind.bestonfilessnissuedatecannotbeverified"], cuv_df["cfind.inputssnrecordedasdeceased"], dropna = False, margins = True)
pd.crosstab(match_df["cfind.bestonfilessnissuedatecannotbeverified"], match_df["cfind.inputssnrecordedasdeceased"], dropna = False, margins = True)
""";
"""
tbl = pd.crosstab(index = [match_df["cfind.bestonfilessnrecordedasdeceased"], match_df["cfind.inputssnrecordedasdeceased"]],
columns = match_df["cfind.bestonfilessnissuedatecannotbeverified"], dropna = False, margins = True)
# Rename the columns to add "cfind.bestonfilessnissuedatecannotbeverified"" as the header above the values
tbl.columns = pd.MultiIndex.from_tuples([("cfind.bestonfilessnissuedatecannotbeverified", "False"),
("cfind.bestonfilessnissuedatecannotbeverified", "NaN"),
("cfind.bestonfilessnissuedatecannotbeverified", "All")])
# Display the table with "cfind.bestonfilessnissuedatecannotbeverified centered
tbl.columns.names = [None, None] # Remove the column name for the first level
# Display the table with column names at the top
print(tbl.to_string(header = True, index = True))
del tbl
""";
phonematchtype¶
pd.crosstab(cuv_df["cfindvrfy.phonematchtypedescription"].fillna("NaN"),
cuv_df["cfindvrfy.phonematchtype"].fillna("NaN"),
dropna = False,
margins = True)
| cfindvrfy.phonematchtype | A | F | FA | L | LA | M | NaN | P | U | All |
|---|---|---|---|---|---|---|---|---|---|---|
| cfindvrfy.phonematchtypedescription | ||||||||||
| (A) Address Only | 189 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 189 |
| (F) Full Name Only | 0 | 130 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 130 |
| (FA) Full Name and Address | 0 | 0 | 729 | 0 | 0 | 0 | 0 | 0 | 0 | 729 |
| (L) Last Name Only | 0 | 0 | 0 | 50 | 0 | 0 | 0 | 0 | 0 | 50 |
| (LA) Last Name and Address | 0 | 0 | 0 | 0 | 191 | 0 | 0 | 0 | 0 | 191 |
| (M) Mobile Phone | 0 | 0 | 0 | 0 | 0 | 45846 | 0 | 0 | 0 | 45846 |
| (P) Pager | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 47 | 0 | 47 |
| (U) Unlisted | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1617 | 1617 |
| NaN | 0 | 0 | 0 | 0 | 0 | 0 | 953 | 0 | 0 | 953 |
| All | 189 | 130 | 729 | 50 | 191 | 45846 | 953 | 47 | 1617 | 49752 |
ssnnamereasoncode¶
pd.crosstab(cuv_df["cfindvrfy.ssnnamereasoncodedescription"].fillna("NaN"),
cuv_df["cfindvrfy.ssnnamereasoncode"].fillna("NaN"),
dropna = False,
margins = True)
| cfindvrfy.ssnnamereasoncode | NaN | S03 | S07 | All |
|---|---|---|---|---|
| cfindvrfy.ssnnamereasoncodedescription | ||||
| (S03) SSN match to address only | 0 | 2374 | 0 | 2374 |
| (S07) SSN Match to last name only | 0 | 0 | 295 | 295 |
| NaN | 47083 | 0 | 0 | 47083 |
| All | 47083 | 2374 | 295 | 49752 |
nameaddressreasoncode¶
pd.crosstab(cuv_df["cfindvrfy.nameaddressreasoncodedescription"].fillna("NaN"),
cuv_df["cfindvrfy.nameaddressreasoncode"].fillna("NaN"),
dropna = False,
margins = True)
| cfindvrfy.nameaddressreasoncode | A8 | NaN | All |
|---|---|---|---|
| cfindvrfy.nameaddressreasoncodedescription | |||
| (A8) Match to Last Name only | 5627 | 0 | 5627 |
| NaN | 0 | 44125 | 44125 |
| All | 5627 | 44125 | 49752 |
Drop columns¶
- Identical columns from above👆 i.e. phonematchtype, ssnnamereasoncode, nameaddressreasoncode
anal_df(match_df)
DataFrame Overview
- First 5 entries:
| cfinq.thirtydaysago | cfinq.twentyfourhoursago | cfinq.oneminuteago | cfinq.onehourago | cfinq.ninetydaysago | cfinq.sevendaysago | cfinq.tenminutesago | cfinq.fifteendaysago | cfinq.threesixtyfivedaysago | cfind.inquiryonfilecurrentaddressconflict | cfind.totalnumberoffraudindicators | cfind.telephonenumberinconsistentwithaddress | cfind.inquiryageyoungerthanssnissuedate | cfind.onfileaddresscautious | cfind.inquiryaddressnonresidential | cfind.onfileaddresshighrisk | cfind.ssnreportedmorefrequentlyforanother | cfind.currentaddressreportedbytradeopenlt90days | cfind.inputssninvalid | cfind.inputssnissuedatecannotbeverified | cfind.inquiryaddresscautious | cfind.morethan3inquiriesinthelast30days | cfind.onfileaddressnonresidential | cfind.creditestablishedpriortossnissuedate | cfind.driverlicenseformatinvalid | cfind.inputssnrecordedasdeceased | cfind.inquiryaddresshighrisk | cfind.inquirycurrentaddressnotonfile | cfind.bestonfilessnissuedatecannotbeverified | cfind.highprobabilityssnbelongstoanother | cfind.maxnumberofssnswithanybankaccount | cfind.bestonfilessnrecordedasdeceased | cfind.currentaddressreportedbynewtradeonly | cfind.creditestablishedbeforeage18 | cfind.telephonenumberinconsistentwithstate | cfind.driverlicenseinconsistentwithonfile | cfind.workphonepreviouslylistedascellphone | cfind.workphonepreviouslylistedashomephone | cfindvrfy.ssnnamematch | cfindvrfy.nameaddressmatch | cfindvrfy.phonematchtype | cfindvrfy.ssnnamereasoncodedescription | cfindvrfy.phonematchresult | cfindvrfy.nameaddressreasoncodedescription | cfindvrfy.phonematchtypedescription | cfindvrfy.overallmatchresult | cfindvrfy.phonetype | cfindvrfy.ssndobreasoncode | cfindvrfy.ssnnamereasoncode | cfindvrfy.nameaddressreasoncode | cfindvrfy.ssndobmatch | cfindvrfy.overallmatchreasoncode | clearfraudscore | underwritingid | loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanStatus | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clarityFraudId | hasCF | principal_tot | fees_tot | paymentAmount_tot | sum_days_btw_pymts | mean_days_btw_pymts | med_days_btw_pymts | std_days_btw_pymts | cnt_days_btw_pymts | min_days_btw_pymts | max_days_btw_pymts | sum_fees_Cancelled | sum_fees_Checked | sum_fees_Complete | sum_fees_None | sum_fees_Pending | sum_fees_Rejected | sum_fees_Rejected Awaiting Retry | sum_fees_Returned | sum_fees_Skipped | sum_principal_Cancelled | sum_principal_Checked | sum_principal_Complete | sum_principal_None | sum_principal_Pending | sum_principal_Rejected | sum_principal_Rejected Awaiting Retry | sum_principal_Returned | sum_principal_Skipped | sum_pymtAmt_Cancelled | sum_pymtAmt_Checked | sum_pymtAmt_Complete | sum_pymtAmt_None | sum_pymtAmt_Pending | sum_pymtAmt_Rejected | sum_pymtAmt_Rejected Awaiting Retry | sum_pymtAmt_Returned | sum_pymtAmt_Skipped | mean_fees_Cancelled | mean_fees_Checked | mean_fees_Complete | mean_fees_None | mean_fees_Pending | mean_fees_Rejected | mean_fees_Rejected Awaiting Retry | mean_fees_Returned | mean_fees_Skipped | mean_principal_Cancelled | mean_principal_Checked | mean_principal_Complete | mean_principal_None | mean_principal_Pending | mean_principal_Rejected | mean_principal_Rejected Awaiting Retry | mean_principal_Returned | mean_principal_Skipped | mean_pymtAmt_Cancelled | mean_pymtAmt_Checked | mean_pymtAmt_Complete | mean_pymtAmt_None | mean_pymtAmt_Pending | mean_pymtAmt_Rejected | mean_pymtAmt_Rejected Awaiting Retry | mean_pymtAmt_Returned | mean_pymtAmt_Skipped | med_fees_Cancelled | med_fees_Checked | med_fees_Complete | med_fees_None | med_fees_Pending | med_fees_Rejected | med_fees_Rejected Awaiting Retry | med_fees_Returned | med_fees_Skipped | med_principal_Cancelled | med_principal_Checked | med_principal_Complete | med_principal_None | med_principal_Pending | med_principal_Rejected | med_principal_Rejected Awaiting Retry | med_principal_Returned | med_principal_Skipped | med_pymtAmt_Cancelled | med_pymtAmt_Checked | med_pymtAmt_Complete | med_pymtAmt_None | med_pymtAmt_Pending | med_pymtAmt_Rejected | med_pymtAmt_Rejected Awaiting Retry | med_pymtAmt_Returned | med_pymtAmt_Skipped | std_fees_Cancelled | std_fees_Checked | std_fees_None | std_fees_Pending | std_fees_Rejected | std_fees_Rejected Awaiting Retry | std_fees_Skipped | std_principal_Cancelled | std_principal_Checked | std_principal_None | std_principal_Pending | std_principal_Rejected | std_principal_Rejected Awaiting Retry | std_principal_Skipped | std_pymtAmt_Cancelled | std_pymtAmt_Checked | std_pymtAmt_None | std_pymtAmt_Pending | std_pymtAmt_Rejected | std_pymtAmt_Rejected Awaiting Retry | std_pymtAmt_Skipped | cnt_fees_Cancelled | cnt_fees_Checked | cnt_fees_Complete | cnt_fees_None | cnt_fees_Pending | cnt_fees_Rejected | cnt_fees_Rejected Awaiting Retry | cnt_fees_Returned | cnt_fees_Skipped | cnt_principal_Cancelled | cnt_principal_Checked | cnt_principal_Complete | cnt_principal_None | cnt_principal_Pending | cnt_principal_Rejected | cnt_principal_Rejected Awaiting Retry | cnt_principal_Returned | cnt_principal_Skipped | cnt_pymtAmt_Cancelled | cnt_pymtAmt_Checked | cnt_pymtAmt_Complete | cnt_pymtAmt_None | cnt_pymtAmt_Pending | cnt_pymtAmt_Rejected | cnt_pymtAmt_Rejected Awaiting Retry | cnt_pymtAmt_Returned | cnt_pymtAmt_Skipped | min_fees_Cancelled | min_fees_Checked | min_fees_Complete | min_fees_None | min_fees_Pending | min_fees_Rejected | min_fees_Rejected Awaiting Retry | min_fees_Returned | min_fees_Skipped | min_principal_Cancelled | min_principal_Checked | min_principal_Complete | min_principal_None | min_principal_Pending | min_principal_Rejected | min_principal_Rejected Awaiting Retry | min_principal_Returned | min_principal_Skipped | min_pymtAmt_Cancelled | min_pymtAmt_Checked | min_pymtAmt_Complete | min_pymtAmt_None | min_pymtAmt_Pending | min_pymtAmt_Rejected | min_pymtAmt_Rejected Awaiting Retry | min_pymtAmt_Returned | min_pymtAmt_Skipped | max_fees_Cancelled | max_fees_Checked | max_fees_Complete | max_fees_None | max_fees_Pending | max_fees_Rejected | max_fees_Rejected Awaiting Retry | max_fees_Returned | max_fees_Skipped | max_principal_Cancelled | max_principal_Checked | max_principal_Complete | max_principal_None | max_principal_Pending | max_principal_Rejected | max_principal_Rejected Awaiting Retry | max_principal_Returned | max_principal_Skipped | max_pymtAmt_Cancelled | max_pymtAmt_Checked | max_pymtAmt_Complete | max_pymtAmt_None | max_pymtAmt_Pending | max_pymtAmt_Rejected | max_pymtAmt_Rejected Awaiting Retry | max_pymtAmt_Returned | max_pymtAmt_Skipped | cnt_custom | cnt_non custom | cnt_pymtStatus_Cancelled | cnt_pymtStatus_Checked | cnt_pymtStatus_Complete | cnt_pymtStatus_None | cnt_pymtStatus_Pending | cnt_pymtStatus_Rejected | cnt_pymtStatus_Rejected Awaiting Retry | cnt_pymtStatus_Returned | cnt_pymtStatus_Skipped | cnt_pymtRCode_C01 | cnt_pymtRCode_C02 | cnt_pymtRCode_C03 | cnt_pymtRCode_C05 | cnt_pymtRCode_C07 | cnt_pymtRCode_LPP01 | cnt_pymtRCode_MISSED | cnt_pymtRCode_R01 | cnt_pymtRCode_R02 | cnt_pymtRCode_R03 | cnt_pymtRCode_R04 | cnt_pymtRCode_R06 | cnt_pymtRCode_R07 | cnt_pymtRCode_R08 | cnt_pymtRCode_R09 | cnt_pymtRCode_R10 | cnt_pymtRCode_R13 | cnt_pymtRCode_R15 | cnt_pymtRCode_R16 | cnt_pymtRCode_R19 | cnt_pymtRCode_R20 | cnt_pymtRCode_R29 | cnt_pymtRCode_R99 | cnt_pymtRCode_RAF | cnt_pymtRCode_RBW | cnt_pymtRCode_RFG | cnt_pymtRCode_RIR | cnt_pymtRCode_RUP | cnt_pymtRCode_RWC | cnt_pymtRCode_RXL | cnt_pymtRCode_RXS | fpymtDate | fpymtAmt | fpymtStatus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 1 | 1 | 1 | 5 | 1 | 1 | 1 | 61 | False | 4 | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | 3 | False | False | False | True | <NA> | True | False | match | unavailable | M | NaN | unavailable | NaN | (M) Mobile Phone | partial | NaN | D01 | NaN | NaN | partial | 17 | 840.0 | 56cdc263e4b05b76b3c77cd8 | LL-I-00002148 | 2b2951c8841c4737159133b21256e398 | B | 442.89 | 2014-12-03 15:36:04.144 | True | 2014-12-03 19:51:18.918 | <NA> | True | True | External Collection | 1000.0 | 2408.88 | OH | lead | 25.0 | Checked | 56cdc263e4b05b76b3c77cd8 | True | 30.40 | 170.34 | 200.74 | 153.0 | 6.954545 | 6.5 | 7.121244 | 22 | 0.0 | 14.0 | 1073.38 | 170.34 | 0.0 | 1073.38 | 0.0 | 165.16 | 0.0 | 0.0 | 0.00 | 934.02 | 30.40 | 0.0 | 934.02 | 0.0 | 35.58 | 0.0 | 0.0 | 0.0 | 2007.40 | 200.74 | 0.0 | 2007.4 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 107.338000 | 170.340000 | 0.0 | 107.338000 | 0.0 | 165.160 | 0.0 | 0.0 | 0.00 | 93.402000 | 30.400000 | 0.0 | 93.402000 | 0.0 | 35.580 | 0.0 | 0.0 | 0.0 | 200.740000 | 200.740000 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 115.975 | 170.34 | 0.0 | 115.975 | 0.0 | 165.160 | 0.0 | 0.0 | 0.00 | 84.765 | 30.400 | 0.0 | 84.765 | 0.0 | 35.580 | 0.0 | 0.0 | 0.0 | 200.74 | 200.74 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 43.566097 | 0.000000 | 43.566097 | 0.0 | 0.000000 | 0.0 | 0.0 | 43.566097 | 0.000000 | 43.566097 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 10 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 10 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 10 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 29.14 | 170.34 | 0.0 | 29.14 | 0.0 | 165.16 | 0.0 | 0.0 | 0.00 | 41.64 | 30.4 | 0.0 | 41.64 | 0.0 | 35.58 | 0.0 | 0.0 | 0.0 | 200.74 | 200.74 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 159.10 | 170.34 | 0.0 | 159.10 | 0.0 | 165.16 | 0.0 | 0.0 | 0.00 | 171.60 | 30.40 | 0.0 | 171.60 | 0.0 | 35.58 | 0.0 | 0.0 | 0.0 | 200.74 | 200.74 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 0 | 22 | 10 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-12-19 05:00:00 | 200.74 | Checked |
| 1 | 5 | 5 | 2 | 2 | 6 | 5 | 2 | 5 | 6 | False | 1 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 1 | False | False | False | False | <NA> | False | False | match | match | M | NaN | unavailable | NaN | (M) Mobile Phone | match | NaN | NaN | NaN | NaN | match | 1 | 768.0 | 54cc1d67e4b0ba763e445b45 | LL-I-00202645 | 6d655fceaf71be89b0e0923409da4a2a | W | 478.67 | 2015-01-31 00:10:21.133 | True | 2015-02-02 18:52:53.444 | 0 | True | True | Paid Off Loan | 600.0 | 1463.49 | OH | lead | 6.0 | Checked | 54cc1d67e4b0ba763e445b45 | True | 589.98 | 141.25 | 731.23 | 159.0 | 6.360000 | 7.0 | 1.933908 | 25 | 0.0 | 7.0 | 753.54 | 141.25 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 582.43 | 589.98 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 1335.97 | 731.23 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 35.882857 | 35.312500 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 27.734762 | 147.495000 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 63.617619 | 182.807500 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 39.470 | 43.01 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 24.160 | 8.785 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 63.63 | 63.63 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 14.748671 | 25.976832 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 14.721834 | 283.307111 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.056737 | 260.174600 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 21 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 21 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 5.34 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 10.02 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 63.37 | 31.56 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 53.61 | 55.23 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 58.03 | 572.41 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 63.63 | 572.41 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0 | 25 | 21 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-06 05:00:00 | 31.56 | Checked |
| 2 | 11 | 6 | 6 | 6 | 21 | 6 | 6 | 6 | 21 | True | 3 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 1 | False | False | False | False | <NA> | True | False | match | match | M | NaN | unavailable | NaN | (M) Mobile Phone | match | NaN | NaN | NaN | NaN | match | 1 | 564.0 | 54cc38e1e4b0ba763e44dad0 | LL-I-00202774 | e231152748a80ccd619017d44034923f | B | 570.32 | 2015-01-31 02:07:32.590 | True | 2015-02-02 19:58:48.514 | 0 | True | True | External Collection | 400.0 | 1087.90 | OH | lead | 10.0 | Checked | 54cc38e1e4b0ba763e44dad0 | True | 0.00 | 106.54 | 106.54 | 153.0 | 7.285714 | 13.0 | 7.121396 | 21 | 0.0 | 14.0 | 514.87 | 106.54 | 0.0 | 514.87 | 0.0 | 173.03 | 0.0 | 0.0 | 0.00 | 375.23 | 0.00 | 0.0 | 375.23 | 0.0 | 24.77 | 0.0 | 0.0 | 0.0 | 890.10 | 106.54 | 0.0 | 890.1 | 0.0 | 197.80 | 0.0 | 0.0 | 0.00 | 57.207778 | 106.540000 | 0.0 | 57.207778 | 0.0 | 86.515 | 0.0 | 0.0 | 0.00 | 41.692222 | 0.000000 | 0.0 | 41.692222 | 0.0 | 12.385 | 0.0 | 0.0 | 0.0 | 98.900000 | 106.540000 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 62.220 | 106.54 | 0.0 | 62.220 | 0.0 | 86.515 | 0.0 | 0.0 | 0.00 | 36.680 | 0.000 | 0.0 | 36.680 | 0.0 | 12.385 | 0.0 | 0.0 | 0.0 | 98.90 | 106.54 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 22.060721 | 0.000000 | 22.060721 | 0.0 | 1.732412 | 0.0 | 0.0 | 22.060721 | 0.000000 | 22.060721 | 0.0 | 1.732412 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 9 | 1 | 0 | 9 | 0 | 2 | 0 | 0 | 0 | 9 | 1 | 0 | 9 | 0 | 2 | 0 | 0 | 0 | 9 | 1 | 0 | 9 | 0 | 2 | 0 | 0 | 0 | 17.66 | 106.54 | 0.0 | 17.66 | 0.0 | 85.29 | 0.0 | 0.0 | 0.00 | 16.59 | 0.0 | 0.0 | 16.59 | 0.0 | 11.16 | 0.0 | 0.0 | 0.0 | 98.90 | 106.54 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 82.31 | 106.54 | 0.0 | 82.31 | 0.0 | 87.74 | 0.0 | 0.0 | 0.00 | 81.24 | 0.00 | 0.0 | 81.24 | 0.0 | 13.61 | 0.0 | 0.0 | 0.0 | 98.90 | 106.54 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 0 | 21 | 9 | 1 | 0 | 9 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-20 05:00:00 | 106.54 | Checked |
| 3 | 5 | 5 | 2 | 3 | 5 | 5 | 2 | 5 | 8 | False | 2 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 1 | False | True | False | False | <NA> | False | False | match | match | M | NaN | unavailable | NaN | (M) Mobile Phone | match | NaN | NaN | NaN | NaN | match | 1 | 691.0 | 54cd2174e4b0ba763e4b1909 | LL-I-00204105 | 4e4f9e943655df43f0b3d80f532ac7a9 | W | 478.67 | 2015-01-31 18:39:52.732 | True | 2015-02-02 15:13:53.721 | 0 | True | True | Paid Off Loan | 800.0 | 1951.32 | OH | lead | 10.0 | Checked | 54cd2174e4b0ba763e4b1909 | True | 800.00 | 1193.07 | 1993.07 | 160.0 | 6.666667 | 7.0 | 1.434563 | 24 | 0.0 | 7.0 | 0.00 | 1193.07 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 800.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.00 | 1993.07 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.000000 | 49.711250 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 0.000000 | 33.333333 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.000000 | 83.044583 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.000 | 53.98 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 0.000 | 28.255 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.00 | 84.84 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.000000 | 19.742176 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 20.887648 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 8.725679 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0.00 | 7.12 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.00 | 42.08 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 73.64 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 77.39 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.00 | 84.84 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0 | 24 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-06 05:00:00 | 42.08 | Checked |
| 4 | 3 | 2 | 2 | 2 | 9 | 2 | 2 | 2 | 43 | False | 2 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 2 | False | False | False | False | <NA> | True | False | match | match | M | NaN | unavailable | NaN | (M) Mobile Phone | match | NaN | NaN | NaN | NaN | match | 1 | 726.0 | 54cd4169e4b0ba763e4cfc18 | LL-I-00204517 | 5907189120b48af8faabea2c7640791b | B | 478.67 | 2015-01-31 20:56:10.982 | True | 2015-02-02 17:04:57.616 | 0 | True | True | Settlement Paid Off | 700.0 | 1679.37 | OH | lead | 75.0 | Checked | 54cd4169e4b0ba763e4cfc18 | True | 535.71 | 930.26 | 1465.97 | 439.0 | 31.357143 | 14.0 | 73.463342 | 14 | 0.0 | 286.0 | 0.00 | 930.26 | 0.0 | 0.00 | 0.0 | 141.04 | 0.0 | 0.0 | 97.27 | 0.00 | 535.71 | 0.0 | 0.00 | 0.0 | 164.30 | 0.0 | 0.0 | 55.4 | 0.00 | 1465.97 | 0.0 | 0.0 | 0.0 | 305.34 | 0.0 | 0.0 | 152.67 | 0.000000 | 84.569091 | 0.0 | 0.000000 | 0.0 | 70.520 | 0.0 | 0.0 | 97.27 | 0.000000 | 48.700909 | 0.0 | 0.000000 | 0.0 | 82.150 | 0.0 | 0.0 | 55.4 | 0.000000 | 133.270000 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 0.000 | 92.05 | 0.0 | 0.000 | 0.0 | 70.520 | 0.0 | 0.0 | 97.27 | 0.000 | 39.510 | 0.0 | 0.000 | 0.0 | 82.150 | 0.0 | 0.0 | 55.4 | 0.00 | 152.67 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 0.000000 | 41.967391 | 0.000000 | 0.0 | 37.830213 | 0.0 | 0.0 | 0.000000 | 39.400864 | 0.000000 | 0.0 | 37.830213 | 0.0 | 0.0 | 0.000000 | 47.791123 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0 | 11 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 11 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 11 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 43.77 | 0.0 | 0.0 | 97.27 | 0.00 | 0.0 | 0.0 | 0.00 | 0.0 | 55.40 | 0.0 | 0.0 | 55.4 | 0.00 | 0.01 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 0.00 | 128.87 | 0.0 | 0.00 | 0.0 | 97.27 | 0.0 | 0.0 | 97.27 | 0.00 | 128.83 | 0.0 | 0.00 | 0.0 | 108.90 | 0.0 | 0.0 | 55.4 | 0.00 | 152.67 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 1 | 13 | 0 | 11 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-13 05:00:00 | 92.05 | Checked |
- 0 duplicate rows.
- 32312 entries, 311 columns.
- Check missing values and data types:
Missing Values (n) Proportion (%) Dtype
cfindvrfy.phonetype 31307 96.889700 category
cfindvrfy.ssnnamereasoncodedescription 30551 94.550012 category
cfindvrfy.ssnnamereasoncode 30551 94.550012 category
cfindvrfy.nameaddressreasoncodedescription 28688 88.784353 category
cfindvrfy.nameaddressreasoncode 28688 88.784353 category
cfindvrfy.ssndobreasoncode 26469 81.916935 category
cfind.driverlicenseinconsistentwithonfile 25926 80.236445 boolean
cfind.workphonepreviouslylistedascellphone 17476 54.085170 boolean
cfind.workphonepreviouslylistedashomephone 17476 54.085170 boolean
cnt_fees_Cancelled 6395 19.791409 Int32
cnt_fees_Checked 6395 19.791409 Int32
std_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
std_pymtAmt_Skipped 6395 19.791409 float64
cnt_fees_Complete 6395 19.791409 Int32
cnt_fees_None 6395 19.791409 Int32
cnt_fees_Pending 6395 19.791409 Int32
cnt_fees_Rejected 6395 19.791409 Int32
cnt_fees_Rejected Awaiting Retry 6395 19.791409 Int32
cnt_fees_Returned 6395 19.791409 Int32
cnt_fees_Skipped 6395 19.791409 Int32
med_pymtAmt_Cancelled 6395 19.791409 float64
cnt_principal_Checked 6395 19.791409 Int32
cnt_pymtAmt_Complete 6395 19.791409 Int32
min_fees_Cancelled 6395 19.791409 float64
cnt_pymtAmt_Skipped 6395 19.791409 Int32
cnt_pymtAmt_Returned 6395 19.791409 Int32
cnt_pymtAmt_Rejected Awaiting Retry 6395 19.791409 Int32
cnt_pymtAmt_Rejected 6395 19.791409 Int32
cnt_pymtAmt_Pending 6395 19.791409 Int32
cnt_pymtAmt_None 6395 19.791409 Int32
cnt_pymtAmt_Checked 6395 19.791409 Int32
cnt_principal_Complete 6395 19.791409 Int32
cnt_pymtAmt_Cancelled 6395 19.791409 Int32
cnt_principal_Skipped 6395 19.791409 Int32
cnt_principal_Returned 6395 19.791409 Int32
cnt_principal_Rejected Awaiting Retry 6395 19.791409 Int32
cnt_principal_Rejected 6395 19.791409 Int32
cnt_principal_Pending 6395 19.791409 Int32
cnt_principal_None 6395 19.791409 Int32
cnt_principal_Cancelled 6395 19.791409 Int32
std_pymtAmt_Pending 6395 19.791409 float64
std_pymtAmt_Rejected 6395 19.791409 float64
min_fees_Complete 6395 19.791409 float64
med_pymtAmt_Returned 6395 19.791409 float64
med_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
med_pymtAmt_Rejected 6395 19.791409 float64
med_pymtAmt_Pending 6395 19.791409 float64
med_pymtAmt_None 6395 19.791409 float64
med_pymtAmt_Complete 6395 19.791409 float64
med_pymtAmt_Checked 6395 19.791409 float64
cnt_pymtRCode_MISSED 6395 19.791409 Int32
med_principal_Skipped 6395 19.791409 float64
med_principal_Returned 6395 19.791409 float64
med_principal_Rejected Awaiting Retry 6395 19.791409 float64
med_principal_Rejected 6395 19.791409 float64
med_principal_Pending 6395 19.791409 float64
med_principal_None 6395 19.791409 float64
med_principal_Complete 6395 19.791409 float64
med_pymtAmt_Skipped 6395 19.791409 float64
std_fees_Cancelled 6395 19.791409 float64
std_fees_Checked 6395 19.791409 float64
std_principal_Pending 6395 19.791409 float64
std_pymtAmt_None 6395 19.791409 float64
std_pymtAmt_Checked 6395 19.791409 float64
std_pymtAmt_Cancelled 6395 19.791409 float64
std_principal_Skipped 6395 19.791409 float64
std_principal_Rejected Awaiting Retry 6395 19.791409 float64
std_principal_Rejected 6395 19.791409 float64
std_principal_None 6395 19.791409 float64
std_fees_None 6395 19.791409 float64
std_principal_Checked 6395 19.791409 float64
std_principal_Cancelled 6395 19.791409 float64
std_fees_Skipped 6395 19.791409 float64
std_fees_Rejected Awaiting Retry 6395 19.791409 float64
std_fees_Rejected 6395 19.791409 float64
std_fees_Pending 6395 19.791409 float64
min_fees_Checked 6395 19.791409 float64
min_fees_Returned 6395 19.791409 float64
min_fees_None 6395 19.791409 float64
max_pymtAmt_Skipped 6395 19.791409 float64
max_principal_None 6395 19.791409 float64
max_principal_Pending 6395 19.791409 float64
max_principal_Rejected 6395 19.791409 float64
max_principal_Rejected Awaiting Retry 6395 19.791409 float64
max_principal_Returned 6395 19.791409 float64
max_principal_Skipped 6395 19.791409 float64
max_pymtAmt_Cancelled 6395 19.791409 float64
max_pymtAmt_Checked 6395 19.791409 float64
max_pymtAmt_Complete 6395 19.791409 float64
max_pymtAmt_None 6395 19.791409 float64
max_pymtAmt_Pending 6395 19.791409 float64
max_pymtAmt_Rejected 6395 19.791409 float64
max_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
max_pymtAmt_Returned 6395 19.791409 float64
cnt_custom 6395 19.791409 Int32
max_principal_Checked 6395 19.791409 float64
cnt_non custom 6395 19.791409 Int32
cnt_pymtStatus_Cancelled 6395 19.791409 Int32
cnt_pymtStatus_Checked 6395 19.791409 Int32
cnt_pymtStatus_Complete 6395 19.791409 Int32
cnt_pymtStatus_None 6395 19.791409 Int32
cnt_pymtStatus_Pending 6395 19.791409 Int32
cnt_pymtStatus_Rejected 6395 19.791409 Int32
cnt_pymtStatus_Rejected Awaiting Retry 6395 19.791409 Int32
cnt_pymtStatus_Returned 6395 19.791409 Int32
cnt_pymtStatus_Skipped 6395 19.791409 Int32
cnt_pymtRCode_C01 6395 19.791409 Int32
cnt_pymtRCode_C02 6395 19.791409 Int32
cnt_pymtRCode_C03 6395 19.791409 Int32
cnt_pymtRCode_C05 6395 19.791409 Int32
max_principal_Complete 6395 19.791409 float64
max_principal_Cancelled 6395 19.791409 float64
min_fees_Pending 6395 19.791409 float64
min_pymtAmt_Checked 6395 19.791409 float64
min_fees_Rejected 6395 19.791409 float64
min_fees_Rejected Awaiting Retry 6395 19.791409 float64
med_principal_Cancelled 6395 19.791409 float64
min_fees_Skipped 6395 19.791409 float64
min_principal_Cancelled 6395 19.791409 float64
min_principal_Checked 6395 19.791409 float64
min_principal_Complete 6395 19.791409 float64
min_principal_None 6395 19.791409 float64
min_principal_Pending 6395 19.791409 float64
min_principal_Rejected 6395 19.791409 float64
min_principal_Rejected Awaiting Retry 6395 19.791409 float64
min_principal_Returned 6395 19.791409 float64
min_principal_Skipped 6395 19.791409 float64
min_pymtAmt_Cancelled 6395 19.791409 float64
min_pymtAmt_Complete 6395 19.791409 float64
max_fees_Skipped 6395 19.791409 float64
min_pymtAmt_None 6395 19.791409 float64
min_pymtAmt_Pending 6395 19.791409 float64
min_pymtAmt_Rejected 6395 19.791409 float64
min_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
min_pymtAmt_Returned 6395 19.791409 float64
min_pymtAmt_Skipped 6395 19.791409 float64
max_fees_Cancelled 6395 19.791409 float64
max_fees_Checked 6395 19.791409 float64
max_fees_Complete 6395 19.791409 float64
max_fees_None 6395 19.791409 float64
max_fees_Pending 6395 19.791409 float64
max_fees_Rejected 6395 19.791409 float64
max_fees_Rejected Awaiting Retry 6395 19.791409 float64
max_fees_Returned 6395 19.791409 float64
med_principal_Checked 6395 19.791409 float64
med_fees_Pending 6395 19.791409 float64
med_fees_Skipped 6395 19.791409 float64
cnt_pymtRCode_R01 6395 19.791409 Int32
fees_tot 6395 19.791409 float64
paymentAmount_tot 6395 19.791409 float64
sum_days_btw_pymts 6395 19.791409 float64
med_fees_Returned 6395 19.791409 float64
med_days_btw_pymts 6395 19.791409 float64
std_days_btw_pymts 6395 19.791409 float64
cnt_days_btw_pymts 6395 19.791409 Int32
min_days_btw_pymts 6395 19.791409 float64
max_days_btw_pymts 6395 19.791409 float64
sum_fees_Cancelled 6395 19.791409 float64
sum_fees_Checked 6395 19.791409 float64
sum_fees_Complete 6395 19.791409 float64
sum_fees_None 6395 19.791409 float64
sum_fees_Pending 6395 19.791409 float64
sum_fees_Rejected 6395 19.791409 float64
sum_fees_Rejected Awaiting Retry 6395 19.791409 float64
sum_fees_Returned 6395 19.791409 float64
sum_fees_Skipped 6395 19.791409 float64
sum_principal_Cancelled 6395 19.791409 float64
sum_principal_Checked 6395 19.791409 float64
sum_principal_Complete 6395 19.791409 float64
principal_tot 6395 19.791409 float64
cnt_pymtRCode_R02 6395 19.791409 Int32
sum_principal_Pending 6395 19.791409 float64
cnt_pymtRCode_R03 6395 19.791409 Int32
cnt_pymtRCode_RXS 6395 19.791409 Int32
cnt_pymtRCode_RXL 6395 19.791409 Int32
cnt_pymtRCode_RWC 6395 19.791409 Int32
cnt_pymtRCode_RUP 6395 19.791409 Int32
cnt_pymtRCode_RIR 6395 19.791409 Int32
cnt_pymtRCode_RFG 6395 19.791409 Int32
cnt_pymtRCode_RBW 6395 19.791409 Int32
cnt_pymtRCode_RAF 6395 19.791409 Int32
cnt_pymtRCode_R99 6395 19.791409 Int32
cnt_pymtRCode_R29 6395 19.791409 Int32
cnt_pymtRCode_R20 6395 19.791409 Int32
cnt_pymtRCode_R19 6395 19.791409 Int32
cnt_pymtRCode_R16 6395 19.791409 Int32
cnt_pymtRCode_R15 6395 19.791409 Int32
cnt_pymtRCode_R13 6395 19.791409 Int32
cnt_pymtRCode_R10 6395 19.791409 Int32
cnt_pymtRCode_R09 6395 19.791409 Int32
cnt_pymtRCode_R08 6395 19.791409 Int32
cnt_pymtRCode_R07 6395 19.791409 Int32
cnt_pymtRCode_R06 6395 19.791409 Int32
cnt_pymtRCode_R04 6395 19.791409 Int32
sum_principal_None 6395 19.791409 float64
mean_days_btw_pymts 6395 19.791409 float64
sum_principal_Rejected 6395 19.791409 float64
mean_principal_None 6395 19.791409 float64
mean_principal_Rejected 6395 19.791409 float64
mean_principal_Rejected Awaiting Retry 6395 19.791409 float64
mean_principal_Returned 6395 19.791409 float64
mean_principal_Skipped 6395 19.791409 float64
mean_pymtAmt_Cancelled 6395 19.791409 float64
mean_pymtAmt_Checked 6395 19.791409 float64
mean_pymtAmt_Complete 6395 19.791409 float64
mean_pymtAmt_None 6395 19.791409 float64
mean_pymtAmt_Pending 6395 19.791409 float64
mean_pymtAmt_Rejected 6395 19.791409 float64
mean_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
mean_pymtAmt_Returned 6395 19.791409 float64
mean_pymtAmt_Skipped 6395 19.791409 float64
sum_principal_Rejected Awaiting Retry 6395 19.791409 float64
med_fees_Cancelled 6395 19.791409 float64
med_fees_Checked 6395 19.791409 float64
med_fees_Complete 6395 19.791409 float64
med_fees_None 6395 19.791409 float64
cnt_pymtRCode_LPP01 6395 19.791409 Int32
med_fees_Rejected 6395 19.791409 float64
med_fees_Rejected Awaiting Retry 6395 19.791409 float64
mean_principal_Pending 6395 19.791409 float64
cnt_pymtRCode_C07 6395 19.791409 Int32
mean_fees_Cancelled 6395 19.791409 float64
mean_fees_Complete 6395 19.791409 float64
sum_pymtAmt_None 6395 19.791409 float64
sum_pymtAmt_Complete 6395 19.791409 float64
sum_pymtAmt_Checked 6395 19.791409 float64
sum_pymtAmt_Cancelled 6395 19.791409 float64
sum_pymtAmt_Rejected 6395 19.791409 float64
sum_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
sum_pymtAmt_Returned 6395 19.791409 float64
sum_pymtAmt_Skipped 6395 19.791409 float64
mean_principal_Complete 6395 19.791409 float64
mean_fees_Checked 6395 19.791409 float64
mean_fees_None 6395 19.791409 float64
sum_principal_Skipped 6395 19.791409 float64
sum_principal_Returned 6395 19.791409 float64
mean_fees_Pending 6395 19.791409 float64
mean_fees_Rejected 6395 19.791409 float64
mean_fees_Rejected Awaiting Retry 6395 19.791409 float64
mean_fees_Returned 6395 19.791409 float64
mean_fees_Skipped 6395 19.791409 float64
mean_principal_Cancelled 6395 19.791409 float64
mean_principal_Checked 6395 19.791409 float64
sum_pymtAmt_Pending 6395 19.791409 float64
cfind.driverlicenseformatinvalid 3412 10.559544 boolean
cfindvrfy.phonematchtypedescription 612 1.894033 category
cfindvrfy.phonematchtype 612 1.894033 category
cfind.telephonenumberinconsistentwithstate 449 1.389577 boolean
fpStatus 141 0.436370 category
clearfraudscore 93 0.287819 float64
cfind.currentaddressreportedbynewtradeonly 34 0.105224 boolean
cfind.inputssnissuedatecannotbeverified 34 0.105224 boolean
cfindvrfy.phonematchresult 34 0.105224 category
cfind.inquiryonfilecurrentaddressconflict 34 0.105224 boolean
cfind.telephonenumberinconsistentwithaddress 34 0.105224 boolean
cfind.inquiryageyoungerthanssnissuedate 34 0.105224 boolean
cfind.onfileaddresscautious 34 0.105224 boolean
cfind.inquiryaddressnonresidential 34 0.105224 boolean
cfind.onfileaddresshighrisk 34 0.105224 boolean
cfind.creditestablishedbeforeage18 34 0.105224 boolean
cfind.currentaddressreportedbytradeopenlt90days 34 0.105224 boolean
cfind.inputssninvalid 34 0.105224 boolean
cfind.ssnreportedmorefrequentlyforanother 34 0.105224 boolean
cfind.inquiryaddresscautious 34 0.105224 boolean
cfind.inquiryaddresshighrisk 34 0.105224 boolean
cfind.highprobabilityssnbelongstoanother 34 0.105224 boolean
cfind.morethan3inquiriesinthelast30days 34 0.105224 boolean
cfind.bestonfilessnissuedatecannotbeverified 34 0.105224 boolean
cfind.inquirycurrentaddressnotonfile 34 0.105224 boolean
cfind.bestonfilessnrecordedasdeceased 34 0.105224 boolean
cfind.inputssnrecordedasdeceased 34 0.105224 boolean
cfind.creditestablishedpriortossnissuedate 34 0.105224 boolean
cfind.onfileaddressnonresidential 34 0.105224 boolean
cfindvrfy.ssnnamematch 26 0.080465 category
cfindvrfy.nameaddressmatch 26 0.080465 category
cfindvrfy.overallmatchresult 26 0.080465 category
cfindvrfy.ssndobmatch 26 0.080465 category
cfindvrfy.overallmatchreasoncode 26 0.080465 category
originatedDate 18 0.055707 datetime64[ns]
cfind.maxnumberofssnswithanybankaccount 17 0.052612 Int32
cfind.totalnumberoffraudindicators 17 0.052612 Int32
nPaidOff 2 0.006190 Int32
cfinq.thirtydaysago 1 0.003095 Int32
cfinq.twentyfourhoursago 1 0.003095 Int32
cfinq.oneminuteago 1 0.003095 Int32
cfinq.onehourago 1 0.003095 Int32
cfinq.ninetydaysago 1 0.003095 Int32
cfinq.sevendaysago 1 0.003095 Int32
cfinq.tenminutesago 1 0.003095 Int32
cfinq.fifteendaysago 1 0.003095 Int32
cfinq.threesixtyfivedaysago 1 0.003095 Int32
hasCF 0 0.000000 boolean
fpymtAmt 0 0.000000 float64
fpymtDate 0 0.000000 datetime64[ns]
underwritingid 0 0.000000 object
loanId 0 0.000000 object
anon_ssn 0 0.000000 object
payFrequency 0 0.000000 category
apr 0 0.000000 float64
applicationDate 0 0.000000 datetime64[ns]
originated 0 0.000000 boolean
approved 0 0.000000 boolean
isFunded 0 0.000000 boolean
loanStatus 0 0.000000 category
loanAmount 0 0.000000 float64
originallyScheduledPaymentAmount 0 0.000000 float64
state 0 0.000000 category
leadType 0 0.000000 category
leadCost 0 0.000000 float64
clarityFraudId 0 0.000000 object
fpymtStatus 0 0.000000 category
cols_to_drop = (
["cfindvrfy.phonematchtypedescription", "cfindvrfy.ssnnamereasoncodedescription", "cfindvrfy.nameaddressreasoncodedescription"] +
[col for col in match_df.columns if col.startswith("cnt_fees_")] + # Values in cnt_X_Y columns are the same as the cnt_pymtStatus_Y columns
[col for col in match_df.columns if col.startswith("cnt_principal_")] +
[col for col in match_df.columns if col.startswith("cnt_pymtAmt_")]
)
clean_df = match_df.drop(columns = cols_to_drop)
del match_df
anal_df(clean_df)
DataFrame Overview
- First 5 entries:
| cfinq.thirtydaysago | cfinq.twentyfourhoursago | cfinq.oneminuteago | cfinq.onehourago | cfinq.ninetydaysago | cfinq.sevendaysago | cfinq.tenminutesago | cfinq.fifteendaysago | cfinq.threesixtyfivedaysago | cfind.inquiryonfilecurrentaddressconflict | cfind.totalnumberoffraudindicators | cfind.telephonenumberinconsistentwithaddress | cfind.inquiryageyoungerthanssnissuedate | cfind.onfileaddresscautious | cfind.inquiryaddressnonresidential | cfind.onfileaddresshighrisk | cfind.ssnreportedmorefrequentlyforanother | cfind.currentaddressreportedbytradeopenlt90days | cfind.inputssninvalid | cfind.inputssnissuedatecannotbeverified | cfind.inquiryaddresscautious | cfind.morethan3inquiriesinthelast30days | cfind.onfileaddressnonresidential | cfind.creditestablishedpriortossnissuedate | cfind.driverlicenseformatinvalid | cfind.inputssnrecordedasdeceased | cfind.inquiryaddresshighrisk | cfind.inquirycurrentaddressnotonfile | cfind.bestonfilessnissuedatecannotbeverified | cfind.highprobabilityssnbelongstoanother | cfind.maxnumberofssnswithanybankaccount | cfind.bestonfilessnrecordedasdeceased | cfind.currentaddressreportedbynewtradeonly | cfind.creditestablishedbeforeage18 | cfind.telephonenumberinconsistentwithstate | cfind.driverlicenseinconsistentwithonfile | cfind.workphonepreviouslylistedascellphone | cfind.workphonepreviouslylistedashomephone | cfindvrfy.ssnnamematch | cfindvrfy.nameaddressmatch | cfindvrfy.phonematchtype | cfindvrfy.phonematchresult | cfindvrfy.overallmatchresult | cfindvrfy.phonetype | cfindvrfy.ssndobreasoncode | cfindvrfy.ssnnamereasoncode | cfindvrfy.nameaddressreasoncode | cfindvrfy.ssndobmatch | cfindvrfy.overallmatchreasoncode | clearfraudscore | underwritingid | loanId | anon_ssn | payFrequency | apr | applicationDate | originated | originatedDate | nPaidOff | approved | isFunded | loanStatus | loanAmount | originallyScheduledPaymentAmount | state | leadType | leadCost | fpStatus | clarityFraudId | hasCF | principal_tot | fees_tot | paymentAmount_tot | sum_days_btw_pymts | mean_days_btw_pymts | med_days_btw_pymts | std_days_btw_pymts | cnt_days_btw_pymts | min_days_btw_pymts | max_days_btw_pymts | sum_fees_Cancelled | sum_fees_Checked | sum_fees_Complete | sum_fees_None | sum_fees_Pending | sum_fees_Rejected | sum_fees_Rejected Awaiting Retry | sum_fees_Returned | sum_fees_Skipped | sum_principal_Cancelled | sum_principal_Checked | sum_principal_Complete | sum_principal_None | sum_principal_Pending | sum_principal_Rejected | sum_principal_Rejected Awaiting Retry | sum_principal_Returned | sum_principal_Skipped | sum_pymtAmt_Cancelled | sum_pymtAmt_Checked | sum_pymtAmt_Complete | sum_pymtAmt_None | sum_pymtAmt_Pending | sum_pymtAmt_Rejected | sum_pymtAmt_Rejected Awaiting Retry | sum_pymtAmt_Returned | sum_pymtAmt_Skipped | mean_fees_Cancelled | mean_fees_Checked | mean_fees_Complete | mean_fees_None | mean_fees_Pending | mean_fees_Rejected | mean_fees_Rejected Awaiting Retry | mean_fees_Returned | mean_fees_Skipped | mean_principal_Cancelled | mean_principal_Checked | mean_principal_Complete | mean_principal_None | mean_principal_Pending | mean_principal_Rejected | mean_principal_Rejected Awaiting Retry | mean_principal_Returned | mean_principal_Skipped | mean_pymtAmt_Cancelled | mean_pymtAmt_Checked | mean_pymtAmt_Complete | mean_pymtAmt_None | mean_pymtAmt_Pending | mean_pymtAmt_Rejected | mean_pymtAmt_Rejected Awaiting Retry | mean_pymtAmt_Returned | mean_pymtAmt_Skipped | med_fees_Cancelled | med_fees_Checked | med_fees_Complete | med_fees_None | med_fees_Pending | med_fees_Rejected | med_fees_Rejected Awaiting Retry | med_fees_Returned | med_fees_Skipped | med_principal_Cancelled | med_principal_Checked | med_principal_Complete | med_principal_None | med_principal_Pending | med_principal_Rejected | med_principal_Rejected Awaiting Retry | med_principal_Returned | med_principal_Skipped | med_pymtAmt_Cancelled | med_pymtAmt_Checked | med_pymtAmt_Complete | med_pymtAmt_None | med_pymtAmt_Pending | med_pymtAmt_Rejected | med_pymtAmt_Rejected Awaiting Retry | med_pymtAmt_Returned | med_pymtAmt_Skipped | std_fees_Cancelled | std_fees_Checked | std_fees_None | std_fees_Pending | std_fees_Rejected | std_fees_Rejected Awaiting Retry | std_fees_Skipped | std_principal_Cancelled | std_principal_Checked | std_principal_None | std_principal_Pending | std_principal_Rejected | std_principal_Rejected Awaiting Retry | std_principal_Skipped | std_pymtAmt_Cancelled | std_pymtAmt_Checked | std_pymtAmt_None | std_pymtAmt_Pending | std_pymtAmt_Rejected | std_pymtAmt_Rejected Awaiting Retry | std_pymtAmt_Skipped | min_fees_Cancelled | min_fees_Checked | min_fees_Complete | min_fees_None | min_fees_Pending | min_fees_Rejected | min_fees_Rejected Awaiting Retry | min_fees_Returned | min_fees_Skipped | min_principal_Cancelled | min_principal_Checked | min_principal_Complete | min_principal_None | min_principal_Pending | min_principal_Rejected | min_principal_Rejected Awaiting Retry | min_principal_Returned | min_principal_Skipped | min_pymtAmt_Cancelled | min_pymtAmt_Checked | min_pymtAmt_Complete | min_pymtAmt_None | min_pymtAmt_Pending | min_pymtAmt_Rejected | min_pymtAmt_Rejected Awaiting Retry | min_pymtAmt_Returned | min_pymtAmt_Skipped | max_fees_Cancelled | max_fees_Checked | max_fees_Complete | max_fees_None | max_fees_Pending | max_fees_Rejected | max_fees_Rejected Awaiting Retry | max_fees_Returned | max_fees_Skipped | max_principal_Cancelled | max_principal_Checked | max_principal_Complete | max_principal_None | max_principal_Pending | max_principal_Rejected | max_principal_Rejected Awaiting Retry | max_principal_Returned | max_principal_Skipped | max_pymtAmt_Cancelled | max_pymtAmt_Checked | max_pymtAmt_Complete | max_pymtAmt_None | max_pymtAmt_Pending | max_pymtAmt_Rejected | max_pymtAmt_Rejected Awaiting Retry | max_pymtAmt_Returned | max_pymtAmt_Skipped | cnt_custom | cnt_non custom | cnt_pymtStatus_Cancelled | cnt_pymtStatus_Checked | cnt_pymtStatus_Complete | cnt_pymtStatus_None | cnt_pymtStatus_Pending | cnt_pymtStatus_Rejected | cnt_pymtStatus_Rejected Awaiting Retry | cnt_pymtStatus_Returned | cnt_pymtStatus_Skipped | cnt_pymtRCode_C01 | cnt_pymtRCode_C02 | cnt_pymtRCode_C03 | cnt_pymtRCode_C05 | cnt_pymtRCode_C07 | cnt_pymtRCode_LPP01 | cnt_pymtRCode_MISSED | cnt_pymtRCode_R01 | cnt_pymtRCode_R02 | cnt_pymtRCode_R03 | cnt_pymtRCode_R04 | cnt_pymtRCode_R06 | cnt_pymtRCode_R07 | cnt_pymtRCode_R08 | cnt_pymtRCode_R09 | cnt_pymtRCode_R10 | cnt_pymtRCode_R13 | cnt_pymtRCode_R15 | cnt_pymtRCode_R16 | cnt_pymtRCode_R19 | cnt_pymtRCode_R20 | cnt_pymtRCode_R29 | cnt_pymtRCode_R99 | cnt_pymtRCode_RAF | cnt_pymtRCode_RBW | cnt_pymtRCode_RFG | cnt_pymtRCode_RIR | cnt_pymtRCode_RUP | cnt_pymtRCode_RWC | cnt_pymtRCode_RXL | cnt_pymtRCode_RXS | fpymtDate | fpymtAmt | fpymtStatus | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 4 | 1 | 1 | 1 | 5 | 1 | 1 | 1 | 61 | False | 4 | True | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | False | 3 | False | False | False | True | <NA> | True | False | match | unavailable | M | unavailable | partial | NaN | D01 | NaN | NaN | partial | 17 | 840.0 | 56cdc263e4b05b76b3c77cd8 | LL-I-00002148 | 2b2951c8841c4737159133b21256e398 | B | 442.89 | 2014-12-03 15:36:04.144 | True | 2014-12-03 19:51:18.918 | <NA> | True | True | External Collection | 1000.0 | 2408.88 | OH | lead | 25.0 | Checked | 56cdc263e4b05b76b3c77cd8 | True | 30.40 | 170.34 | 200.74 | 153.0 | 6.954545 | 6.5 | 7.121244 | 22 | 0.0 | 14.0 | 1073.38 | 170.34 | 0.0 | 1073.38 | 0.0 | 165.16 | 0.0 | 0.0 | 0.00 | 934.02 | 30.40 | 0.0 | 934.02 | 0.0 | 35.58 | 0.0 | 0.0 | 0.0 | 2007.40 | 200.74 | 0.0 | 2007.4 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 107.338000 | 170.340000 | 0.0 | 107.338000 | 0.0 | 165.160 | 0.0 | 0.0 | 0.00 | 93.402000 | 30.400000 | 0.0 | 93.402000 | 0.0 | 35.580 | 0.0 | 0.0 | 0.0 | 200.740000 | 200.740000 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 115.975 | 170.34 | 0.0 | 115.975 | 0.0 | 165.160 | 0.0 | 0.0 | 0.00 | 84.765 | 30.400 | 0.0 | 84.765 | 0.0 | 35.580 | 0.0 | 0.0 | 0.0 | 200.74 | 200.74 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 43.566097 | 0.000000 | 43.566097 | 0.0 | 0.000000 | 0.0 | 0.0 | 43.566097 | 0.000000 | 43.566097 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 29.14 | 170.34 | 0.0 | 29.14 | 0.0 | 165.16 | 0.0 | 0.0 | 0.00 | 41.64 | 30.4 | 0.0 | 41.64 | 0.0 | 35.58 | 0.0 | 0.0 | 0.0 | 200.74 | 200.74 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 159.10 | 170.34 | 0.0 | 159.10 | 0.0 | 165.16 | 0.0 | 0.0 | 0.00 | 171.60 | 30.40 | 0.0 | 171.60 | 0.0 | 35.58 | 0.0 | 0.0 | 0.0 | 200.74 | 200.74 | 0.0 | 200.74 | 0.0 | 200.74 | 0.0 | 0.0 | 0.00 | 0 | 22 | 10 | 1 | 0 | 10 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2014-12-19 05:00:00 | 200.74 | Checked |
| 1 | 5 | 5 | 2 | 2 | 6 | 5 | 2 | 5 | 6 | False | 1 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 1 | False | False | False | False | <NA> | False | False | match | match | M | unavailable | match | NaN | NaN | NaN | NaN | match | 1 | 768.0 | 54cc1d67e4b0ba763e445b45 | LL-I-00202645 | 6d655fceaf71be89b0e0923409da4a2a | W | 478.67 | 2015-01-31 00:10:21.133 | True | 2015-02-02 18:52:53.444 | 0 | True | True | Paid Off Loan | 600.0 | 1463.49 | OH | lead | 6.0 | Checked | 54cc1d67e4b0ba763e445b45 | True | 589.98 | 141.25 | 731.23 | 159.0 | 6.360000 | 7.0 | 1.933908 | 25 | 0.0 | 7.0 | 753.54 | 141.25 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 582.43 | 589.98 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 1335.97 | 731.23 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 35.882857 | 35.312500 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 27.734762 | 147.495000 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 63.617619 | 182.807500 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 39.470 | 43.01 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 24.160 | 8.785 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 63.63 | 63.63 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 14.748671 | 25.976832 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 14.721834 | 283.307111 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.056737 | 260.174600 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 5.34 | 0.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 10.02 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 63.37 | 31.56 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 53.61 | 55.23 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 58.03 | 572.41 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 63.63 | 572.41 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0 | 25 | 21 | 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-06 05:00:00 | 31.56 | Checked |
| 2 | 11 | 6 | 6 | 6 | 21 | 6 | 6 | 6 | 21 | True | 3 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 1 | False | False | False | False | <NA> | True | False | match | match | M | unavailable | match | NaN | NaN | NaN | NaN | match | 1 | 564.0 | 54cc38e1e4b0ba763e44dad0 | LL-I-00202774 | e231152748a80ccd619017d44034923f | B | 570.32 | 2015-01-31 02:07:32.590 | True | 2015-02-02 19:58:48.514 | 0 | True | True | External Collection | 400.0 | 1087.90 | OH | lead | 10.0 | Checked | 54cc38e1e4b0ba763e44dad0 | True | 0.00 | 106.54 | 106.54 | 153.0 | 7.285714 | 13.0 | 7.121396 | 21 | 0.0 | 14.0 | 514.87 | 106.54 | 0.0 | 514.87 | 0.0 | 173.03 | 0.0 | 0.0 | 0.00 | 375.23 | 0.00 | 0.0 | 375.23 | 0.0 | 24.77 | 0.0 | 0.0 | 0.0 | 890.10 | 106.54 | 0.0 | 890.1 | 0.0 | 197.80 | 0.0 | 0.0 | 0.00 | 57.207778 | 106.540000 | 0.0 | 57.207778 | 0.0 | 86.515 | 0.0 | 0.0 | 0.00 | 41.692222 | 0.000000 | 0.0 | 41.692222 | 0.0 | 12.385 | 0.0 | 0.0 | 0.0 | 98.900000 | 106.540000 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 62.220 | 106.54 | 0.0 | 62.220 | 0.0 | 86.515 | 0.0 | 0.0 | 0.00 | 36.680 | 0.000 | 0.0 | 36.680 | 0.0 | 12.385 | 0.0 | 0.0 | 0.0 | 98.90 | 106.54 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 22.060721 | 0.000000 | 22.060721 | 0.0 | 1.732412 | 0.0 | 0.0 | 22.060721 | 0.000000 | 22.060721 | 0.0 | 1.732412 | 0.0 | 0.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 17.66 | 106.54 | 0.0 | 17.66 | 0.0 | 85.29 | 0.0 | 0.0 | 0.00 | 16.59 | 0.0 | 0.0 | 16.59 | 0.0 | 11.16 | 0.0 | 0.0 | 0.0 | 98.90 | 106.54 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 82.31 | 106.54 | 0.0 | 82.31 | 0.0 | 87.74 | 0.0 | 0.0 | 0.00 | 81.24 | 0.00 | 0.0 | 81.24 | 0.0 | 13.61 | 0.0 | 0.0 | 0.0 | 98.90 | 106.54 | 0.0 | 98.90 | 0.0 | 98.90 | 0.0 | 0.0 | 0.00 | 0 | 21 | 9 | 1 | 0 | 9 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-20 05:00:00 | 106.54 | Checked |
| 3 | 5 | 5 | 2 | 3 | 5 | 5 | 2 | 5 | 8 | False | 2 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 1 | False | True | False | False | <NA> | False | False | match | match | M | unavailable | match | NaN | NaN | NaN | NaN | match | 1 | 691.0 | 54cd2174e4b0ba763e4b1909 | LL-I-00204105 | 4e4f9e943655df43f0b3d80f532ac7a9 | W | 478.67 | 2015-01-31 18:39:52.732 | True | 2015-02-02 15:13:53.721 | 0 | True | True | Paid Off Loan | 800.0 | 1951.32 | OH | lead | 10.0 | Checked | 54cd2174e4b0ba763e4b1909 | True | 800.00 | 1193.07 | 1993.07 | 160.0 | 6.666667 | 7.0 | 1.434563 | 24 | 0.0 | 7.0 | 0.00 | 1193.07 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 800.00 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.00 | 1993.07 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.000000 | 49.711250 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 0.000000 | 33.333333 | 0.0 | 0.000000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.000000 | 83.044583 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.000 | 53.98 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.00 | 0.000 | 28.255 | 0.0 | 0.000 | 0.0 | 0.000 | 0.0 | 0.0 | 0.0 | 0.00 | 84.84 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.000000 | 19.742176 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 20.887648 | 0.000000 | 0.0 | 0.000000 | 0.0 | 0.0 | 0.000000 | 8.725679 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 7.12 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.00 | 42.08 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 73.64 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0.00 | 77.39 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.0 | 0.00 | 84.84 | 0.0 | 0.00 | 0.0 | 0.00 | 0.0 | 0.0 | 0.00 | 0 | 24 | 0 | 24 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-06 05:00:00 | 42.08 | Checked |
| 4 | 3 | 2 | 2 | 2 | 9 | 2 | 2 | 2 | 43 | False | 2 | True | False | False | False | False | False | False | False | False | False | False | False | False | <NA> | False | False | False | False | False | 2 | False | False | False | False | <NA> | True | False | match | match | M | unavailable | match | NaN | NaN | NaN | NaN | match | 1 | 726.0 | 54cd4169e4b0ba763e4cfc18 | LL-I-00204517 | 5907189120b48af8faabea2c7640791b | B | 478.67 | 2015-01-31 20:56:10.982 | True | 2015-02-02 17:04:57.616 | 0 | True | True | Settlement Paid Off | 700.0 | 1679.37 | OH | lead | 75.0 | Checked | 54cd4169e4b0ba763e4cfc18 | True | 535.71 | 930.26 | 1465.97 | 439.0 | 31.357143 | 14.0 | 73.463342 | 14 | 0.0 | 286.0 | 0.00 | 930.26 | 0.0 | 0.00 | 0.0 | 141.04 | 0.0 | 0.0 | 97.27 | 0.00 | 535.71 | 0.0 | 0.00 | 0.0 | 164.30 | 0.0 | 0.0 | 55.4 | 0.00 | 1465.97 | 0.0 | 0.0 | 0.0 | 305.34 | 0.0 | 0.0 | 152.67 | 0.000000 | 84.569091 | 0.0 | 0.000000 | 0.0 | 70.520 | 0.0 | 0.0 | 97.27 | 0.000000 | 48.700909 | 0.0 | 0.000000 | 0.0 | 82.150 | 0.0 | 0.0 | 55.4 | 0.000000 | 133.270000 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 0.000 | 92.05 | 0.0 | 0.000 | 0.0 | 70.520 | 0.0 | 0.0 | 97.27 | 0.000 | 39.510 | 0.0 | 0.000 | 0.0 | 82.150 | 0.0 | 0.0 | 55.4 | 0.00 | 152.67 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 0.000000 | 41.967391 | 0.000000 | 0.0 | 37.830213 | 0.0 | 0.0 | 0.000000 | 39.400864 | 0.000000 | 0.0 | 37.830213 | 0.0 | 0.0 | 0.000000 | 47.791123 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.00 | 0.00 | 0.0 | 0.00 | 0.0 | 43.77 | 0.0 | 0.0 | 97.27 | 0.00 | 0.0 | 0.0 | 0.00 | 0.0 | 55.40 | 0.0 | 0.0 | 55.4 | 0.00 | 0.01 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 0.00 | 128.87 | 0.0 | 0.00 | 0.0 | 97.27 | 0.0 | 0.0 | 97.27 | 0.00 | 128.83 | 0.0 | 0.00 | 0.0 | 108.90 | 0.0 | 0.0 | 55.4 | 0.00 | 152.67 | 0.0 | 0.00 | 0.0 | 152.67 | 0.0 | 0.0 | 152.67 | 1 | 13 | 0 | 11 | 0 | 0 | 0 | 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 2015-02-13 05:00:00 | 92.05 | Checked |
- 0 duplicate rows.
- 32312 entries, 281 columns.
- Check missing values and data types:
Missing Values (n) Proportion (%) Dtype
cfindvrfy.phonetype 31307 96.889700 category
cfindvrfy.ssnnamereasoncode 30551 94.550012 category
cfindvrfy.nameaddressreasoncode 28688 88.784353 category
cfindvrfy.ssndobreasoncode 26469 81.916935 category
cfind.driverlicenseinconsistentwithonfile 25926 80.236445 boolean
cfind.workphonepreviouslylistedascellphone 17476 54.085170 boolean
cfind.workphonepreviouslylistedashomephone 17476 54.085170 boolean
std_principal_Rejected 6395 19.791409 float64
std_principal_Pending 6395 19.791409 float64
std_principal_None 6395 19.791409 float64
std_principal_Checked 6395 19.791409 float64
min_principal_Cancelled 6395 19.791409 float64
std_principal_Cancelled 6395 19.791409 float64
std_fees_Skipped 6395 19.791409 float64
std_fees_Rejected Awaiting Retry 6395 19.791409 float64
std_fees_Rejected 6395 19.791409 float64
std_principal_Rejected Awaiting Retry 6395 19.791409 float64
med_fees_Rejected Awaiting Retry 6395 19.791409 float64
std_pymtAmt_Cancelled 6395 19.791409 float64
std_pymtAmt_Checked 6395 19.791409 float64
std_pymtAmt_None 6395 19.791409 float64
std_pymtAmt_Pending 6395 19.791409 float64
std_pymtAmt_Rejected 6395 19.791409 float64
std_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
std_pymtAmt_Skipped 6395 19.791409 float64
min_fees_Cancelled 6395 19.791409 float64
min_fees_Checked 6395 19.791409 float64
min_fees_Complete 6395 19.791409 float64
min_fees_None 6395 19.791409 float64
min_fees_Pending 6395 19.791409 float64
min_fees_Rejected 6395 19.791409 float64
min_fees_Rejected Awaiting Retry 6395 19.791409 float64
min_fees_Returned 6395 19.791409 float64
std_principal_Skipped 6395 19.791409 float64
std_fees_Pending 6395 19.791409 float64
std_fees_None 6395 19.791409 float64
med_principal_None 6395 19.791409 float64
mean_pymtAmt_Returned 6395 19.791409 float64
mean_pymtAmt_Skipped 6395 19.791409 float64
med_fees_Cancelled 6395 19.791409 float64
med_fees_Checked 6395 19.791409 float64
med_fees_Complete 6395 19.791409 float64
med_fees_None 6395 19.791409 float64
med_fees_Pending 6395 19.791409 float64
med_fees_Rejected 6395 19.791409 float64
cnt_pymtRCode_MISSED 6395 19.791409 Int32
med_fees_Returned 6395 19.791409 float64
med_fees_Skipped 6395 19.791409 float64
med_principal_Cancelled 6395 19.791409 float64
med_principal_Checked 6395 19.791409 float64
med_principal_Complete 6395 19.791409 float64
med_principal_Pending 6395 19.791409 float64
std_fees_Checked 6395 19.791409 float64
med_principal_Rejected 6395 19.791409 float64
med_principal_Rejected Awaiting Retry 6395 19.791409 float64
med_principal_Returned 6395 19.791409 float64
med_principal_Skipped 6395 19.791409 float64
med_pymtAmt_Cancelled 6395 19.791409 float64
med_pymtAmt_Checked 6395 19.791409 float64
med_pymtAmt_Complete 6395 19.791409 float64
med_pymtAmt_None 6395 19.791409 float64
med_pymtAmt_Pending 6395 19.791409 float64
med_pymtAmt_Rejected 6395 19.791409 float64
med_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
med_pymtAmt_Returned 6395 19.791409 float64
med_pymtAmt_Skipped 6395 19.791409 float64
std_fees_Cancelled 6395 19.791409 float64
min_fees_Skipped 6395 19.791409 float64
min_principal_Pending 6395 19.791409 float64
min_principal_Checked 6395 19.791409 float64
min_principal_Complete 6395 19.791409 float64
max_principal_Rejected Awaiting Retry 6395 19.791409 float64
max_principal_Returned 6395 19.791409 float64
max_principal_Skipped 6395 19.791409 float64
max_pymtAmt_Cancelled 6395 19.791409 float64
max_pymtAmt_Checked 6395 19.791409 float64
max_pymtAmt_Complete 6395 19.791409 float64
max_pymtAmt_None 6395 19.791409 float64
max_pymtAmt_Pending 6395 19.791409 float64
max_pymtAmt_Rejected 6395 19.791409 float64
max_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
max_pymtAmt_Returned 6395 19.791409 float64
max_pymtAmt_Skipped 6395 19.791409 float64
cnt_custom 6395 19.791409 Int32
cnt_non custom 6395 19.791409 Int32
cnt_pymtStatus_Cancelled 6395 19.791409 Int32
cnt_pymtStatus_Checked 6395 19.791409 Int32
cnt_pymtStatus_Complete 6395 19.791409 Int32
cnt_pymtStatus_None 6395 19.791409 Int32
cnt_pymtStatus_Pending 6395 19.791409 Int32
cnt_pymtStatus_Rejected 6395 19.791409 Int32
cnt_pymtStatus_Rejected Awaiting Retry 6395 19.791409 Int32
cnt_pymtStatus_Returned 6395 19.791409 Int32
cnt_pymtStatus_Skipped 6395 19.791409 Int32
cnt_pymtRCode_C01 6395 19.791409 Int32
cnt_pymtRCode_C02 6395 19.791409 Int32
cnt_pymtRCode_C03 6395 19.791409 Int32
cnt_pymtRCode_C05 6395 19.791409 Int32
max_principal_Rejected 6395 19.791409 float64
max_principal_Pending 6395 19.791409 float64
max_principal_None 6395 19.791409 float64
min_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
min_principal_None 6395 19.791409 float64
mean_pymtAmt_Rejected 6395 19.791409 float64
min_principal_Rejected 6395 19.791409 float64
min_principal_Rejected Awaiting Retry 6395 19.791409 float64
min_principal_Returned 6395 19.791409 float64
min_principal_Skipped 6395 19.791409 float64
min_pymtAmt_Cancelled 6395 19.791409 float64
min_pymtAmt_Checked 6395 19.791409 float64
min_pymtAmt_Complete 6395 19.791409 float64
min_pymtAmt_None 6395 19.791409 float64
min_pymtAmt_Pending 6395 19.791409 float64
min_pymtAmt_Rejected 6395 19.791409 float64
min_pymtAmt_Returned 6395 19.791409 float64
max_principal_Complete 6395 19.791409 float64
min_pymtAmt_Skipped 6395 19.791409 float64
max_fees_Cancelled 6395 19.791409 float64
max_fees_Checked 6395 19.791409 float64
max_fees_Complete 6395 19.791409 float64
max_fees_None 6395 19.791409 float64
max_fees_Pending 6395 19.791409 float64
max_fees_Rejected 6395 19.791409 float64
max_fees_Rejected Awaiting Retry 6395 19.791409 float64
max_fees_Returned 6395 19.791409 float64
max_fees_Skipped 6395 19.791409 float64
max_principal_Cancelled 6395 19.791409 float64
max_principal_Checked 6395 19.791409 float64
mean_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
mean_pymtAmt_Checked 6395 19.791409 float64
mean_pymtAmt_Pending 6395 19.791409 float64
med_days_btw_pymts 6395 19.791409 float64
cnt_pymtRCode_R03 6395 19.791409 Int32
mean_pymtAmt_None 6395 19.791409 float64
cnt_pymtRCode_R01 6395 19.791409 Int32
principal_tot 6395 19.791409 float64
fees_tot 6395 19.791409 float64
paymentAmount_tot 6395 19.791409 float64
sum_days_btw_pymts 6395 19.791409 float64
mean_days_btw_pymts 6395 19.791409 float64
std_days_btw_pymts 6395 19.791409 float64
sum_fees_Rejected Awaiting Retry 6395 19.791409 float64
cnt_days_btw_pymts 6395 19.791409 Int32
min_days_btw_pymts 6395 19.791409 float64
max_days_btw_pymts 6395 19.791409 float64
sum_fees_Cancelled 6395 19.791409 float64
sum_fees_Checked 6395 19.791409 float64
sum_fees_Complete 6395 19.791409 float64
sum_fees_None 6395 19.791409 float64
sum_fees_Pending 6395 19.791409 float64
cnt_pymtRCode_R04 6395 19.791409 Int32
cnt_pymtRCode_R06 6395 19.791409 Int32
cnt_pymtRCode_R07 6395 19.791409 Int32
cnt_pymtRCode_R08 6395 19.791409 Int32
cnt_pymtRCode_RXS 6395 19.791409 Int32
cnt_pymtRCode_RXL 6395 19.791409 Int32
cnt_pymtRCode_RWC 6395 19.791409 Int32
cnt_pymtRCode_RUP 6395 19.791409 Int32
cnt_pymtRCode_RIR 6395 19.791409 Int32
cnt_pymtRCode_RFG 6395 19.791409 Int32
cnt_pymtRCode_RBW 6395 19.791409 Int32
cnt_pymtRCode_RAF 6395 19.791409 Int32
cnt_pymtRCode_R99 6395 19.791409 Int32
cnt_pymtRCode_R29 6395 19.791409 Int32
cnt_pymtRCode_R20 6395 19.791409 Int32
cnt_pymtRCode_R19 6395 19.791409 Int32
cnt_pymtRCode_R16 6395 19.791409 Int32
cnt_pymtRCode_R15 6395 19.791409 Int32
cnt_pymtRCode_R13 6395 19.791409 Int32
cnt_pymtRCode_R10 6395 19.791409 Int32
cnt_pymtRCode_R09 6395 19.791409 Int32
sum_fees_Rejected 6395 19.791409 float64
cnt_pymtRCode_R02 6395 19.791409 Int32
sum_fees_Returned 6395 19.791409 float64
mean_fees_Checked 6395 19.791409 float64
mean_fees_None 6395 19.791409 float64
mean_fees_Pending 6395 19.791409 float64
mean_fees_Rejected 6395 19.791409 float64
mean_fees_Rejected Awaiting Retry 6395 19.791409 float64
mean_fees_Returned 6395 19.791409 float64
mean_fees_Skipped 6395 19.791409 float64
mean_principal_Cancelled 6395 19.791409 float64
mean_principal_Checked 6395 19.791409 float64
mean_principal_Complete 6395 19.791409 float64
mean_principal_None 6395 19.791409 float64
mean_principal_Pending 6395 19.791409 float64
sum_fees_Skipped 6395 19.791409 float64
mean_principal_Rejected 6395 19.791409 float64
mean_principal_Rejected Awaiting Retry 6395 19.791409 float64
mean_principal_Returned 6395 19.791409 float64
mean_principal_Skipped 6395 19.791409 float64
mean_pymtAmt_Cancelled 6395 19.791409 float64
cnt_pymtRCode_LPP01 6395 19.791409 Int32
mean_pymtAmt_Complete 6395 19.791409 float64
mean_fees_Complete 6395 19.791409 float64
cnt_pymtRCode_C07 6395 19.791409 Int32
mean_fees_Cancelled 6395 19.791409 float64
sum_pymtAmt_Checked 6395 19.791409 float64
sum_pymtAmt_Skipped 6395 19.791409 float64
sum_principal_Rejected Awaiting Retry 6395 19.791409 float64
sum_principal_Rejected 6395 19.791409 float64
sum_principal_Pending 6395 19.791409 float64
sum_principal_Returned 6395 19.791409 float64
sum_principal_Complete 6395 19.791409 float64
sum_principal_Skipped 6395 19.791409 float64
sum_pymtAmt_Cancelled 6395 19.791409 float64
sum_principal_None 6395 19.791409 float64
sum_pymtAmt_Complete 6395 19.791409 float64
sum_pymtAmt_Rejected Awaiting Retry 6395 19.791409 float64
sum_pymtAmt_Pending 6395 19.791409 float64
sum_pymtAmt_Rejected 6395 19.791409 float64
sum_principal_Checked 6395 19.791409 float64
sum_pymtAmt_None 6395 19.791409 float64
sum_principal_Cancelled 6395 19.791409 float64
sum_pymtAmt_Returned 6395 19.791409 float64
cfind.driverlicenseformatinvalid 3412 10.559544 boolean
cfindvrfy.phonematchtype 612 1.894033 category
cfind.telephonenumberinconsistentwithstate 449 1.389577 boolean
fpStatus 141 0.436370 category
clearfraudscore 93 0.287819 float64
cfind.inputssninvalid 34 0.105224 boolean
cfind.currentaddressreportedbytradeopenlt90days 34 0.105224 boolean
cfind.ssnreportedmorefrequentlyforanother 34 0.105224 boolean
cfind.onfileaddresshighrisk 34 0.105224 boolean
cfind.inquiryonfilecurrentaddressconflict 34 0.105224 boolean
cfind.inquiryaddressnonresidential 34 0.105224 boolean
cfind.onfileaddresscautious 34 0.105224 boolean
cfind.inquiryageyoungerthanssnissuedate 34 0.105224 boolean
cfind.telephonenumberinconsistentwithaddress 34 0.105224 boolean
cfind.inquiryaddresscautious 34 0.105224 boolean
cfind.inputssnissuedatecannotbeverified 34 0.105224 boolean
cfind.bestonfilessnissuedatecannotbeverified 34 0.105224 boolean
cfind.morethan3inquiriesinthelast30days 34 0.105224 boolean
cfindvrfy.phonematchresult 34 0.105224 category
cfind.creditestablishedpriortossnissuedate 34 0.105224 boolean
cfind.inputssnrecordedasdeceased 34 0.105224 boolean
cfind.inquiryaddresshighrisk 34 0.105224 boolean
cfind.inquirycurrentaddressnotonfile 34 0.105224 boolean
cfind.highprobabilityssnbelongstoanother 34 0.105224 boolean
cfind.bestonfilessnrecordedasdeceased 34 0.105224 boolean
cfind.currentaddressreportedbynewtradeonly 34 0.105224 boolean
cfind.onfileaddressnonresidential 34 0.105224 boolean
cfind.creditestablishedbeforeage18 34 0.105224 boolean
cfindvrfy.overallmatchreasoncode 26 0.080465 category
cfindvrfy.overallmatchresult 26 0.080465 category
cfindvrfy.ssndobmatch 26 0.080465 category
cfindvrfy.nameaddressmatch 26 0.080465 category
cfindvrfy.ssnnamematch 26 0.080465 category
originatedDate 18 0.055707 datetime64[ns]
cfind.maxnumberofssnswithanybankaccount 17 0.052612 Int32
cfind.totalnumberoffraudindicators 17 0.052612 Int32
nPaidOff 2 0.006190 Int32
cfinq.thirtydaysago 1 0.003095 Int32
cfinq.twentyfourhoursago 1 0.003095 Int32
cfinq.oneminuteago 1 0.003095 Int32
cfinq.onehourago 1 0.003095 Int32
cfinq.ninetydaysago 1 0.003095 Int32
cfinq.sevendaysago 1 0.003095 Int32
cfinq.tenminutesago 1 0.003095 Int32
cfinq.fifteendaysago 1 0.003095 Int32
cfinq.threesixtyfivedaysago 1 0.003095 Int32
hasCF 0 0.000000 boolean
fpymtAmt 0 0.000000 float64
fpymtDate 0 0.000000 datetime64[ns]
underwritingid 0 0.000000 object
loanId 0 0.000000 object
anon_ssn 0 0.000000 object
payFrequency 0 0.000000 category
apr 0 0.000000 float64
applicationDate 0 0.000000 datetime64[ns]
originated 0 0.000000 boolean
approved 0 0.000000 boolean
isFunded 0 0.000000 boolean
loanStatus 0 0.000000 category
loanAmount 0 0.000000 float64
originallyScheduledPaymentAmount 0 0.000000 float64
state 0 0.000000 category
leadType 0 0.000000 category
leadCost 0 0.000000 float64
clarityFraudId 0 0.000000 object
fpymtStatus 0 0.000000 category
Verify first payment status¶
- Check
loan_df.fpStatusvs.fpymtStatusderived from payment data
# Compare fpStatus and fpymtStatus
# Convert fpStatus and fpymtStatus to strings and replace NaN values with NaN for clarity
pd.crosstab(index = clean_df["fpStatus"].astype(str).fillna("NaN"), # Convert fpStatus to string and replace NaN with "NaN"
columns = clean_df["fpymtStatus"].astype(str).fillna("NaN"),
dropna = False, # Include all NaN values in the table
margins = True, # Include row and column totals
margins_name = "Total").fillna(0).astype(int) # Fill any remaining NaNs with 0 and ensure integers
# Check affected rows:
# Convert both fpStatus and fpymtStatus to strings before comparison
filtered_df = clean_df[clean_df["fpStatus"].astype(str) != clean_df["fpymtStatus"].astype(str)]
filtered_df[["loanId", "originated", "approved", "leadCost", "isFunded", "fpStatus", "fpymtStatus", "fpymtAmt", "loanStatus"]]
#filtered_df[["loanId", "originated", "approved", "leadCost", "isFunded", "fpStatus", "fpymtStatus", "fpymtAmt", "loanStatus"]].to_csv("filtered_df.csv", index = False) # Set index=True to include the index
del filtered_df
| fpymtStatus | Cancelled | Checked | None | Pending | Rejected | Skipped | Total |
|---|---|---|---|---|---|---|---|
| fpStatus | |||||||
| Cancelled | 162 | 1 | 8 | 0 | 0 | 0 | 171 |
| Checked | 0 | 24767 | 1208 | 1074 | 0 | 0 | 27049 |
| Pending | 0 | 0 | 3 | 0 | 0 | 0 | 3 |
| Rejected | 0 | 10 | 362 | 163 | 4292 | 0 | 4827 |
| Skipped | 0 | 0 | 0 | 0 | 0 | 121 | 121 |
| nan | 36 | 45 | 58 | 1 | 0 | 1 | 141 |
| Total | 198 | 24823 | 1639 | 1238 | 4292 | 122 | 32312 |
| loanId | originated | approved | leadCost | isFunded | fpStatus | fpymtStatus | fpymtAmt | loanStatus | |
|---|---|---|---|---|---|---|---|---|---|
| 35 | LL-I-00240780 | True | True | 60.0 | True | Cancelled | Checked | 93.08 | Settlement Paid Off |
| 164 | LL-I-00847881 | True | True | 3.0 | False | NaN | Cancelled | 60.35 | Credit Return Void |
| 194 | LL-I-00904993 | True | True | 3.0 | False | NaN | Cancelled | 51.65 | Credit Return Void |
| 619 | LL-I-01635854 | True | True | 25.0 | False | NaN | Cancelled | 78.47 | Credit Return Void |
| 674 | LL-I-01638507 | True | True | 0.0 | False | NaN | Checked | 300.00 | Customer Voided New Loan |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 32305 | LL-I-18602768 | True | True | 0.0 | True | Checked | None | 124.09 | New Loan |
| 32306 | LL-I-18611386 | True | True | 0.0 | True | Rejected | None | 60.63 | Internal Collection |
| 32307 | LL-I-18625392 | True | True | 0.0 | True | Checked | None | 266.30 | Paid Off Loan |
| 32308 | LL-I-18629093 | True | True | 0.0 | True | Checked | None | 159.04 | Paid Off Loan |
| 32309 | LL-T-01984747 | True | True | 6.0 | True | NaN | None | 12.89 | External Collection |
2970 rows × 9 columns
Inconsistent first payment statuses are indicated by any values outside the diagonal of the table.
fpymtStatus is derived based on the first paymentAmount > 0, whereas fpStatus comes from the provided loan data. However, the method used to derive fpStatus isn't documented. I'm unsure how fpStatus was determined or whether it accurately reflects the true first payment status at the time the data was handed over.
As with the last payment status, despite these inconsistencies, I'll proceed with using fpStatus from the loan data until a subject matter expert reviews the issue and provides further clarification.
Target derivation¶
Based on loanStatus values from the matched data.
According to the provided data dictionary, it's the current loan status. Most are self explanatory.
(i) Returned Item: missed 1 payment (but not more), due to insufficient funds
(ii) Rejected: Rejected by automated underwriting rules – not by human underwriters
(iii) Withdrawn Application – application abandoned for more than 2 weeks, or is withdrawn by a human underwriter or customer
(iv) Statuses with the word "void" in them mean a loan that is approved but cancelled. (One reason is the loan failed to be debited into the customer’s account).
clean_df.loanStatus.value_counts(dropna = False)
loanStatus External Collection 9335 Paid Off Loan 9086 New Loan 6529 Internal Collection 5134 Returned Item 1051 Settlement Paid Off 536 Settled Bankruptcy 283 Pending Paid Off 112 Charged Off Paid Off 109 Credit Return Void 70 Customer Voided New Loan 47 CSR Voided New Loan 16 Withdrawn Application 3 Charged Off 1 Name: count, dtype: int64
Based on the frequency table above, I derive the binary target as follows:
- Safe loans 👇
- Paid Off Loan: Fully repaid loan without issues.
- New Loan: A newly initiated loan, still in good standing.
- Pending Paid Off: Loan nearing or in the process of being fully paid.
- Settlement Paid Off: Loan paid off through a settlement agreement.
- Credit Return Void: Reversal or correction of a loan-related return.
- Customer Voided New Loan: Refer Target derivation. Loan application or agreement was canceled by the customer.
- CSR Voided New Loan: Refer Target derivation. Loan voided by a customer service representative.
- Withdrawn Application: Refer Target derivation.
- Risky loans 👇
- External Collection: Loan transferred to a collection agency due to non-payment.
- Internal Collection: Loan in default, handled by the lender's internal collection team.
- Returned Item: Refer Target derivation.
- Settled Bankruptcy: Loan resolved through a bankruptcy process.
- Charged Off Paid Off: A previously charged-off loan that was later paid off.
- Charged Off: Loan written off as a loss by the lender.
Rationale:
- Safe loans are those that are fully repaid or properly closed without causing any financial loss to the lender.
- Risky loans show that the borrower is having financial trouble like missing payments, having the loan charged off or being sent to collections because they couldn’t keep up with what they owed.
This way of grouping loans helps the lender see which ones are safe such as loans that are fully paid off, new and in good standing or settled through an agreement and which ones are risky such as loans in collections, tied up in bankruptcy or written off. This makes it easier to understand the overall risk in the loan book and plan how to deal with problem loans.
# 0 = Safe loans
# 1 = Risky loans
loanStatus_mapping = {# Safe loans
"Paid Off Loan": 0,
"New Loan": 0,
"Pending Paid Off": 0,
"Settlement Paid Off": 0,
"Credit Return Void": 0,
"Customer Voided New Loan": 0,
"CSR Voided New Loan": 0,
"Withdrawn Application": 0,
# Risky loans
"External Collection": 1,
"Internal Collection": 1,
"Returned Item": 1,
"Settled Bankruptcy": 1,
"Charged Off Paid Off": 1,
"Charged Off": 1}
clean_df["target"] = clean_df["loanStatus"].map(loanStatus_mapping).astype("Int8")
del loanStatus_mapping
Data visualization¶
Correlation¶
- nominal-nominal (categorical-categorical) association: Cramer's V
- numerical-numerical association: Spearman's R
- nom_num_assoc: correlation_ratio
# Convert columns with Dtype "bool" or nullable "boolean" to nullable integer Dtype because dython doesn't handle boolean implicitly
df = clean_df.apply(lambda col: col.astype("Int32") if col.dtypes in ["bool", "boolean"] else col)
fig, ax = plt.subplots(figsize = (20, 20), dpi = 300)
# https://shakedzy.xyz/dython/modules/nominal/
r = associations(df.drop(columns = ["underwritingid", "loanId", "anon_ssn", "clarityFraudId",
"applicationDate", "originatedDate",
"fpymtDate", "fpymtAmt", "fpymtStatus",
"principal_tot", "fees_tot", "paymentAmount_tot"]),
nominal_columns = "auto",
numerical_columns = "auto",
nom_nom_assoc = "cramer",
num_num_assoc = "spearman",
nom_num_assoc = "correlation_ratio",
mark_columns = True,
ax = ax,
plot = False,
clustering = True, # Computed associations are sorted into groups by similar correlations
#filename = "correlation heatmap.png", # Very poor resolution due to large number of features
multiprocessing = True,
max_cpu_cores = 8)
del df
# Correlation matrix in 4 decimal places
corr_matrix_full = r["corr"].round(4)
# Mask off-diagonal values
mask = np.triu(np.ones_like(corr_matrix_full, dtype = bool))
corr_matrix_masked = corr_matrix_full.mask(mask)
corr_matrix_masked.to_csv(f'{temp_dir}/correlation.csv')
del r, mask
# Plot correlation heatmap using Plotly
fig = px.imshow(corr_matrix_masked, color_continuous_scale = "RdBu_r", zmin = -1, zmax = 1)
# Adjust figure
fig.update_layout(width = max(800, int(corr_matrix_masked.shape[0]) * 25), # Dynamically scale heatmap width
height = max(800, int(corr_matrix_masked.shape[0]) * 25), # Dynamically scale heatmap height
title = "Correlation Heatmap",
xaxis = dict(tickangle = 270, # Rotate x-axis labels for better visibility
tickmode = "linear", # Ensure all ticks are shown
automargin = True # Ensure proper margin adjustment
),
yaxis = dict(tickmode = "linear",
automargin = True
),
margin = dict(l = 50, r = 50, b = 50, t = 100) # Add margins to avoid label clipping
)
# Save heatmap as HTML in the same directory
fig.write_html(f'{temp_dir}/correlation_heatmap.html')
# Render figure in default web browser to accommodate memory-intensive plot and ensure better compatibility and larger viewing area
fig.show(renderer = "browser");
del fig;
# Extract all correlation coefficients between features and target
target_corr = corr_matrix_full["target (con)"].dropna()
# Drop the target's self-correlation (always 1.0)
target_corr = target_corr.drop("target (con)", errors = "ignore")
# Select the top 15 features most strongly associated with target
top_21_index = (target_corr.abs()
.sort_values(ascending = False)
.head(21)
.index
)
# Retrieve signed correlation values for these features
top_20_signed = target_corr[top_21_index]
# Separate into positive and negative correlation groups
pos_corr = top_20_signed[top_20_signed >= 0].sort_values(ascending = False)
neg_corr = top_20_signed[top_20_signed < 0].sort_values(ascending = True)
# Combine them to get the desired order (top to bottom)
final_sorted_series = pd.concat([pos_corr, neg_corr])
# Reverse the series for Plotly's bottom-to-top plotting behavior
plot_series = final_sorted_series.iloc[::-1]
# Assign colors based on the correctly ordered data
bar_colors = ["red" if x < 0 else "blue" for x in plot_series.values]
# Build the interactive horizontal bar chart with Plotly
fig = px.bar(x = plot_series.values,
y = plot_series.index,
orientation = "h",
#color = bar_colors,
labels = {"x": "Correlation coefficient", "y": "Feature"},
title = "<b>Top 20 Features Associated with target</b>"
)
# Apply bar colors directly
fig.update_traces(marker_color = bar_colors)
# Add a vertical reference line at x = 0
fig.add_vline(x = 0, line_width = 1, line_dash = "dash", line_color = "black")
# Finalize appearance: hide color legend, center title, adjust height
fig.update_layout(showlegend = False, title_x = 0.5, height = 500)
fig.show();
del corr_matrix_masked, corr_matrix_full, target_corr, top_20_signed, pos_corr, neg_corr, bar_colors, fig;
The heatmap highlights the top 20 features most strongly associated with the target variable. As expected, variables derived directly from loan status and various dimensions of rejected payments dominate the positive correlations, while successfully processed (checked) payments show negative correlations, acting as protective factors.
loanStatus (nom)is perfectly correlated with thetargetbecause it was used to derive it. So, it won't be included in any modelling to avoid data leakage.- Payment rejection patterns are strong risk indicators:
- Count-based:
cnt_pymtStatus_Rejected- frequency of rejections matters - Amount-based:
sum_pymtAmt_Rejected,max_pymtAmt_Rejected,mean_pymtAmt_Rejected,med_pymtAmt_Rejected,min_pymtAmt_Rejected- both the size and central tendency of rejected payments are predictive - Fees-based:
sum_fees_Rejected,mean_fees_Rejected,max_fees_Rejected,med_fees_Rejected,min_fees_Rejected- fees tied to rejections consistently appear important - Principal-based:
sum_principal_Rejected,max_principal_Rejected,mean_principal_Rejected,med_principal_Rejected,min_principal_Rejected,std_principal_Rejected- rejected principal amounts and their distribution are strong signals of risk
- Count-based:
- Checked successful payments are protective factors:
- The negative correlations reveal:
sum_principal_Checked,sum_pymtAmt_Checked,max_principal_Checkedall correlate negatively with the target- Higher amounts of successfully processed payments indicate lower risk
- The negative correlations reveal:
Summary
- High-risk loans: Many rejected payments across principal, amounts and fees, with consistent patterns across totals, averages and variability.
- Low-risk loans: Large, successfully processed payments with few rejections.
Number of anon_ssn and Average loanAmount by Application Count¶
Here, I assume anon_ssn represents a unique person or loan applicant.
# Aggregate laon applications and amount at the anon_ssn level, then compute their average by application count
summary_df = clean_df.groupby("anon_ssn").agg(total_applications = ("anon_ssn", "size"), # Count total applications at the anon_ssn level
sum_loanAmount = ("loanAmount", "sum") # Sum of all loan amounts at the anon_ssn level
).groupby("total_applications").agg(num_anon_ssn = ("total_applications", "count"), # Number of anon_ssn in each application count category
avg_loanAmount = ("sum_loanAmount", "mean")) # Average loan amount for each application count category
# Ensure no gaps in the total_applications sequence
summary_df = summary_df.reindex(range(summary_df.index.min(), summary_df.index.max() + 1), fill_value = 0).reset_index()
fig = px.bar(summary_df, x = "total_applications", y = "num_anon_ssn",
text = "num_anon_ssn", title = "Number of anon_ssn and Average Loan Amount by Application Count")
fig.add_scatter(x = summary_df["total_applications"], y = summary_df["avg_loanAmount"],
mode = "lines+markers", name = "Average Loan Amount", yaxis = "y2", line = dict(color = "brown"))
fig.update_layout(width = 1400,
height = 600,
title = {"x": 0.5, "font": {"size": 18, "weight": "bold"}},
xaxis = dict(title = "Total Loan Applications", dtick = 1),
yaxis = dict(title = "Number of anon_ssn", dtick = 2000, tickformat = ",d"), # Ensure full number with thousands separators instead of "k"
yaxis2 = dict(title = "Average Loan Amount<br>(USD)", overlaying = "y", side = "right"),
legend = dict(x = 0.75, y = 0.95)
)
fig.show();
tbl = (summary_df.rename(columns = {"total_applications": "Number of Loan Applications"})
.set_index("Number of Loan Applications")
.T.rename(index = {"num_anon_ssn": "Number of anon_ssn", "avg_loanAmount": "Average Loan Amount (USD)"})
)
# Display integers for "Number of anon_ssn" row and etc across all their respective columns
display(tbl.style
.format(formatter = "{: .0f}", subset = pd.IndexSlice["Number of anon_ssn", :])
.format(formatter = "{: .2f}", subset=pd.IndexSlice["Average Loan Amount (USD)", :]))
del summary_df, fig, tbl;
| Number of Loan Applications | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 |
|---|---|---|---|---|---|---|---|---|---|---|
| Number of anon_ssn | 28030 | 1740 | 186 | 41 | 9 | 3 | 1 | 0 | 0 | 1 |
| Average Loan Amount (USD) | 653.30 | 1336.45 | 2052.17 | 3010.71 | 3721.44 | 7025.00 | 2800.00 | 0.00 | 0.00 | 8200.00 |
The overall trend shows a steady decrease in the number of applicants as the number of applications rises. Most people applied for a loan just once, with 28,030 individuals in this group. As the number of applications per person (anon_ssn) increases, the number of people in that group drops quickly. This means that only a small number of people apply for multiple loans.
The average loan amount doesn't follow a straight pattern. It generally increases as people apply for more loans, reaching its highest point at 6 applications meaning those with exactly 6 applications received higher loan amounts on average. However, there is a sharp spike for the one person who applied 10 times with significantly larger loans compared to others, even more than the person with 7 applications.
In summary, most applicants applied for a loan only once, and the number of applicants decreases as the number of applications increases. However, the average loan amount varies, with notable peaks at 6 and 10 applications. The sudden spike at 10 applications could mean something unique about that applicant or how loans were given to them.
Loan stages distribution¶
originated->approved->isFunded
# Contingency table
contingency_tbl = clean_df.groupby(["originated", "approved", "isFunded"]).size().unstack(fill_value = 0)
# Heatmap
plt.figure(figsize = (8, 3))
sns.heatmap(contingency_tbl, annot = True, fmt = "d", cmap = "Blues", linewidths = 0.5)
plt.title("Application Counts by Origination, Approval, and Funding Status")
plt.xlabel("isFunded")
plt.ylabel("(Originated - Approved)")
plt.show();
del contingency_tbl;
The heatmap shows that loans which get both approved and originated are very likely to be funded. Only 118 approved and originated loans didn't get funded. Additionally, only 18 loans were neither approved nor originated, meaning most applications in the matched dataset were at least considered. On the other hand, if a loan doesn't make it past the origination and approval stages, it almost never gets funded.
Overall, the matched dataset suggests that getting a loan funded strongly depends on approval and origination. If a loan clears both stages, it's almost always funded. This's consistent with the fact that the dataset contains repayment data only for loans that were actually funded.
Target distribution¶
- Refer Target derivation for grouping description
# Calculate frequency counts and proportions for both columns
target_cnts = clean_df["target"].value_counts()
target_prop = clean_df["target"].value_counts(normalize = True)
# Group by target and category to get counts and proportions
cat_cnts = clean_df.groupby(["target", "loanStatus"], observed = False).size().reset_index(name = "counts")
cat_cnts["proportion"] = cat_cnts["counts"] / len(clean_df)
# Create labels for the sunburst chart
cat_cnts["cat_labels"] = cat_cnts["loanStatus"].astype(str)+" (n = " + cat_cnts["counts"].astype(str) + ", " + (cat_cnts["proportion"] * 100).round(2).astype(str) + "%)"
cat_cnts["target_labels"] = cat_cnts["target"].map({0: "Safe", 1: "Risky"}) + "<br>"+" (n = " + cat_cnts["target"].map(target_cnts).astype(str) + ", "+ cat_cnts["target"].map(target_prop).apply(lambda x: f'{x: .2%}') + ")"
# Sunburst chart
fig = px.sunburst(cat_cnts,
path = ["target_labels", "cat_labels"], # Define the hierarchy of the categories
values = "counts" # Define the size of the segments
)
fig.update_layout(title = dict(text = "<b>Sunburst plot of 'target' and 'loanStatus' with Counts and Proportions</b>",
x = 0.5,
y = 0.98,
xanchor = "center",
yanchor = "top"),
margin = dict(t = 50, l = 50, r = 50, b = 50),
width = 900,
height = 900,
uniformtext_mode = "show" # Ensure all text is shown
)
fig.show()
risky_cols = ["External Collection", "Internal Collection", "Returned Item", "Settled Bankruptcy", "Charged Off Paid Off", "Charged Off"]
safe_cols = ["Paid Off Loan", "New Loan", "Pending Paid Off", "Settlement Paid Off",
"Credit Return Void", "Customer Voided New Loan", "CSR Voided New Loan",
"Withdrawn Application"]
tbl = ((pd.DataFrame({"n": clean_df["loanStatus"].value_counts(dropna = False),
"Proportion (%)": clean_df["loanStatus"].value_counts(dropna = False, normalize = True).mul(100).round(3)
}
)
).T
)
display(tbl.style.format(formatter = "{: .0f}", subset = pd.IndexSlice["n", :])
.set_properties(subset = pd.IndexSlice[:, risky_cols], **{"background-color": "#ffb3b3"})
.set_properties(subset = pd.IndexSlice[:, safe_cols], **{"background-color": "#e6ccff"})
)
del target_cnts, target_prop, cat_cnts, fig, tbl, risky_cols, safe_cols;
| loanStatus | External Collection | Paid Off Loan | New Loan | Internal Collection | Returned Item | Settlement Paid Off | Settled Bankruptcy | Pending Paid Off | Charged Off Paid Off | Credit Return Void | Customer Voided New Loan | CSR Voided New Loan | Withdrawn Application | Charged Off |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| n | 9335 | 9086 | 6529 | 5134 | 1051 | 536 | 283 | 112 | 109 | 70 | 47 | 16 | 3 | 1 |
| Proportion (%) | 28.890000 | 28.120000 | 20.206000 | 15.889000 | 3.253000 | 1.659000 | 0.876000 | 0.347000 | 0.337000 | 0.217000 | 0.145000 | 0.050000 | 0.009000 | 0.003000 |
The sunburst plot reveals that the target variable is approximately evenly distributed, with each class comprising $\approx 50\%$ of the data across the three datasets (loan, underwriting and payment) with matching IDs i.e. underwritingid, clarityFraudId and loanId. This balanced distribution suggests that class imbalance isn't a concern in this scenario.
Looking at the current status, most of the safe loans consist of fully paid-off loans (9086 or 28.12%), followed by new loans (6529 or 20.21%). This's a healthy sign, as these categories are typically favorable from a business perspective.
On the other hand, a significant portion of risky loans has already been sent to collections, both external (9335 or 28.89%) and internal (5134 or 15.89%). Trying to collect these loans usually costs extra money, which lowers the profit lenders can make from them.
This analysis shows both the successes with repaid loans and the challenges with unpaid or risky ones.
loanAmount and number of loan applications over time¶
# Seasonal pattern
# Extract year and month for grouping
clean_df["yr_mth"] = clean_df["applicationDate"].dt.to_period("M")
# Extract month component
clean_df["mth"] = clean_df["yr_mth"].dt.month
# Group by yr_mth and calculate loan application count and sum of loanAmount
mthly_df = clean_df.groupby("yr_mth").agg(application_cnt = ("applicationDate", "size"),
loanAmount = ("loanAmount", "sum")).reset_index()
# Ensure yr_mth is in correct format
mthly_df["yr_mth"] = mthly_df["yr_mth"].astype(str)
# Ensure application_cnt and loanAmount are numeric
mthly_df["application_cnt"] = pd.to_numeric(mthly_df["application_cnt"], errors = "coerce")
mthly_df["loanAmount"] = pd.to_numeric(mthly_df["loanAmount"], errors = "coerce")
mthly_df["scaled_amount"] = mthly_df["loanAmount"] / 1000000
# Create the plot for submission count and loanAmount using go.Scatter
fig = go.Figure()
# Add the submission count trace
fig.add_trace(go.Scatter(x = mthly_df["yr_mth"],
y = mthly_df["application_cnt"],
mode = "lines+markers",
name = "Application Count",
line = dict(color = "blue")))
# Add the loanAmount trace with a secondary y-axis
fig.add_trace(go.Scatter(x = mthly_df["yr_mth"],
y = mthly_df["scaled_amount"],
mode = "lines+markers",
name = "Loan Amount",
line = dict(color = "orange"),
yaxis = "y2"))
# Update layout for dual y-axes and legend
fig.update_layout(title = {"text": "Total Monthly Loan Application and Loan Amount",
"x": 0.45,
"xanchor": "center",
"yanchor": "top",
"font":{"size": 24,
"family": "Arial Black",
"color": "black"
}
},
xaxis = dict(tickmode = "linear", tickformat = "%Y-%m", dtick = "M1"), # Monthly ticks
yaxis = dict(tickmode = "linear", tick0 = 0, dtick = 500, title = "Total number of loan applications"), # 500-interval ticks
yaxis2 = dict(title = "Total Loan Amount (USD) <br> (in 1,000,000)",
overlaying = "y", # Overlay the secondary y-axis on top of the primary y-axis
side = "right", # Place the secondary y-axis on the right
tickformat = ".2f", # 2 decimal places
),
legend = dict(title = "Loan",
x = 1.1, # Position the legend outside the plot area to the right
y = 1, # Align the legend at the top
#bordercolor = "black", # Add a border
#borderwidth = 1 # Set the border width
),
width = 1300,
height = 800
)
fig.show();
tbl = (mthly_df.melt(id_vars = ["yr_mth"], value_vars = ["application_cnt", "loanAmount"])
.replace({"application_cnt": "Number of Loan Applications", "loanAmount": "Loan Amount (USD)"})
.pivot(index = "variable", columns = "yr_mth", values = "value")
.rename_axis(columns = "YYYY-MM")
.reindex(["Number of Loan Applications", "Loan Amount (USD)"]) # Reorder row order
)
display(tbl.style
.format(formatter = "{: .0f}", subset = pd.IndexSlice["Number of Loan Applications", :])
.format(formatter = "{: .2f}", subset=pd.IndexSlice["Loan Amount (USD)", :]))
del mthly_df, fig, tbl;
| YYYY-MM | 2014-12 | 2015-01 | 2015-02 | 2015-03 | 2015-04 | 2015-05 | 2015-06 | 2015-07 | 2015-08 | 2015-09 | 2015-10 | 2015-11 | 2015-12 | 2016-01 | 2016-02 | 2016-03 | 2016-04 | 2016-05 | 2016-06 | 2016-07 | 2016-08 | 2016-09 | 2016-10 | 2016-11 | 2016-12 | 2017-01 | 2017-02 | 2017-03 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| variable | ||||||||||||||||||||||||||||
| Number of Loan Applications | 1 | 4 | 53 | 31 | 107 | 252 | 215 | 342 | 619 | 429 | 619 | 1353 | 1798 | 1405 | 804 | 1066 | 1362 | 879 | 1431 | 1161 | 1024 | 687 | 345 | 1386 | 5315 | 3999 | 3048 | 2577 |
| Loan Amount (USD) | 1000.00 | 2500.00 | 38200.00 | 19575.00 | 55625.00 | 115550.00 | 102375.00 | 167175.00 | 347900.00 | 248286.00 | 340965.00 | 729811.00 | 966477.00 | 923372.00 | 475645.00 | 664882.00 | 1036150.00 | 568703.00 | 1068704.00 | 842600.00 | 781483.00 | 561341.00 | 227926.00 | 738623.00 | 3376319.50 | 2493789.00 | 2326823.00 | 1986206.00 |
The line graph above shows how the number of loan applications and the total amount of money borrowed changed over time. Both increased between December 2014 and March 2017, but there were times when they suddenly went up or down, showing periods when people were taking out more or fewer loans.
From December 2014 to August 2015, both the number of loan applications and the total amount borrowed stayed low and steady. However, between September 2015 and October 2016, both started to rise, with some ups and downs along the way. The biggest jumps happened in December 2015, April 2016 and June 2016.
The highest point in the graph is in December 2016, when both the number of loans and the total money borrowed reached their peak. This could be because of a special event or a time of year, like the holiday season, when more people needed money. After that, both numbers dropped quickly, which might mean that fewer people needed loans or that banks changed their lending rules.
Overall, the graph demonstrates a strong connection between the number of applications and the total loan amount. When applications increase, loan amounts tend to rise, and when applications decrease, loan amounts fall accordingly.
loanAmount and number of loan applications by month¶
mthly_df = clean_df.groupby("mth").agg(application_cnt = ("applicationDate", "size"),
loanAmount = ("loanAmount", "sum")).reset_index()
# Ensure application_cnt and loanAmount are numeric
mthly_df["application_cnt"] = pd.to_numeric(mthly_df["application_cnt"], errors = "coerce")
mthly_df["loanAmount"] = pd.to_numeric(mthly_df["loanAmount"], errors = "coerce")
mthly_df = mthly_df.sort_values(by = "application_cnt", ascending = False)
# Map month labels using the calendar module
mthly_df["mth_label"] = mthly_df["mth"].map(lambda x: calendar.month_name[x])
# Add a new column with sequential values starting from 1
mthly_df["Rank"] = range(1, len(mthly_df) + 1)
# Parallel Coordinates Plot
# https://plotly.com/python-api-reference/generated/plotly.graph_objects.Parcoords.html
# Reverse the minimum and maximum values for the Rank, so that the month with top rank comes on the top
dims = list([dict(range = (mthly_df["Rank"].max(),
mthly_df["Rank"].min()),
tickvals = mthly_df["Rank"],
ticktext = mthly_df["mth_label"],
label = "Month",
values = mthly_df["Rank"]),
dict(range = (mthly_df["application_cnt"].min(),
mthly_df["application_cnt"].max()),
label = "Number of loan application",
values = mthly_df['application_cnt']),
dict(range = (mthly_df['loanAmount'].min(),
mthly_df["loanAmount"].max()),
label = "Loan Amount (USD)", values = mthly_df["loanAmount"]),
])
fig = go.Figure(data = go.Parcoords(line = dict(color = mthly_df["Rank"], colorscale = "picnic"), dimensions = dims))
fig = fig.update_layout(width = 1200, height = 800, margin = dict(l = 150, r = 60, t = 60, b = 40), font = dict(size = 15))
fig.show()
tbl = (mthly_df.melt(id_vars = ["Rank"], value_vars = ["mth_label", "application_cnt", "loanAmount"])
.replace({"mth_label": "Month", "application_cnt": "Number of Loan Applications", "loanAmount": "Loan Amount (USD)"})
.pivot(index = "variable", columns = "Rank", values = "value")
.reindex(["Month", "Number of Loan Applications", "Loan Amount (USD)"]) # Reorder row order
)
display(tbl.style
.format(formatter = "{: .0f}", subset = pd.IndexSlice["Number of Loan Applications", :])
.format(formatter = "{: .2f}", subset=pd.IndexSlice["Loan Amount (USD)", :]))
del mthly_df, dims, fig, tbl;
| Rank | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| variable | ||||||||||||
| Month | December | January | February | March | November | June | August | July | April | May | September | October |
| Number of Loan Applications | 7114 | 5408 | 3905 | 3674 | 2739 | 1646 | 1643 | 1503 | 1469 | 1131 | 1116 | 964 |
| Loan Amount (USD) | 4343796.50 | 3419661.00 | 2840668.00 | 2670663.00 | 1468434.00 | 1171079.00 | 1129383.00 | 1009775.00 | 1091775.00 | 684253.00 | 809627.00 | 568891.00 |
The parallel coordinates plot above shows how the number of loan applications and the total loan amount change across different months of the year.
One clear trend is that December has the highest number of applications and the largest total loan amount. This suggests that borrowing activity tends to peak at the end of the year, possibly for holiday spending, travel or end of year business needs. On the other hand, October sees the lowest loan activity, with the fewest applications and the smallest loan amount. This may reflect a period of stability where less borrowing takes place and earlier loans are being repaid.
The pattern in the chart also shows that when applications increase, the total loan amount rises as well, and when applications drop, the total loan amount decreases. This suggests that borrowing activity is driven by the volume of loans rather than being driven mainly by a few unusually large ones.
Looking at the months in more detail, borrowing remains fairly steady in the first half of the year, then starts to pick up in August. Activity dips into October, before rising again in November and then reaching its peak in December. This shows that the climb toward year end is not a straight line but a mix of ups and downs.
Overall, the chart highlights that borrowing is not evenly distributed across the year, likely for specific financial needs. The number of loan applications and the total amount borrowed follow a clear pattern, indicating that borrowing habits may be influenced by seasonal trends.
When looking at both charts together, it is easy to see that the big jump in December 2016 on the line graph matches what happens most years since December is usually the busiest month for borrowing. The drop in October also matches with the monthly chart where October has the least activity. This shows that loans go up and down not only over time but also depending on the month of the year.
High-level or overall indicators¶
- Acccording to the provided
clarity_underwriting_dictionary.xlsxorclarity_underwriting_dictionary.csv
clearfraudscore¶
- Fraud score provided by clarity
- Higher score suggests lower default probability
boxplt_and_summary_stats(clean_df,
target_col = "target", feat_col = "clearfraudscore",
title = "clearfraudscore by `target`",
y_min = 50, y_max = 1000, step = 50)
- Summary Statistics:
| count | mean | std | min | 25% | 50% | 75% | max | range | IQR | |
|---|---|---|---|---|---|---|---|---|---|---|
| target | ||||||||||
| Safe | 16364 | 710.764 | 122.012 | 177.000 | 622.000 | 727.000 | 805.000 | 963.000 | 786.000 | 183.000 |
| Risky | 15855 | 659.416 | 127.210 | 122.000 | 565.500 | 661.000 | 760.000 | 965.000 | 843.000 | 194.500 |
The box and whisker plot shows that the median fraud score for safe loans is around 727, which is higher than the median score of 661 for risky loans. Similarly, the average fraud score for safe loans is 710.76, while for risky loans, it's 659.42. This suggests that, on average, safe loans tend to have higher fraud scores than risky loans.
Looking at the spread of scores, both groups have a similar range, with standard deviations of around 122 - 127. The middle 50% of fraud scores for safe loans fall between 622 and 805, while for risky loans, they range from 565.5 to 760. This means that fraud scores for risky loans are more spread out compared to those for safe loans.
There are also some unusual values in the data. Some loans have very low fraud scores, as shown by the small circles in the chart, which represent outliers. The lowest fraud score for safe loans is 177, while for risky loans, it's 122. On the higher end, the maximum fraud scores for both groups are nearly the same, around 963 - 965.
One expected observation is that safe loans have higher fraud scores than risky loans.
cfind.totalnumberoffraudindicators¶
- Fraud Indicator: Total Number of unique fraud indicators
boxplt_and_summary_stats(clean_df,
target_col = "target", feat_col = "cfind.totalnumberoffraudindicators",
title = "totalnumberoffraudindicators by `target`",
y_min = -1, y_max = 10, step = 0.5)
- Summary Statistics:
| count | mean | std | min | 25% | 50% | 75% | max | range | IQR | |
|---|---|---|---|---|---|---|---|---|---|---|
| target | ||||||||||
| Safe | 16392 | 2.056 | 1.222 | 0.000 | 1.000 | 2.000 | 3.000 | 8.000 | 8.000 | 2.000 |
| Risky | 15903 | 2.179 | 1.285 | 0.000 | 1.000 | 2.000 | 3.000 | 8.000 | 8.000 | 2.000 |
The box and whisker plot shows how the total number of fraud indicators is distributed for two types of loans, safe and risky. The way the data is spread out looks quite similar for both groups. The middle 50% of the data, known as the interquartile range (IQR), is the same for both types of loans, with a value of 2.0. The median, which represents the middle value of the dataset, is also the same for both categories, at $\approx 2.0$. This means that, on average, the number of fraud indicators doesn't differ much between safe and risky loans.
However, when looking at the average number of fraud indicators, the risky loans have a slightly higher value of 2.179 compared to 2.056 for safe loans. This suggests that, in general, risky loans tend to have a slightly greater number of fraud indicators, but the difference is tiny. The standard deviation, which measures how much the values vary from the average, is also quite similar for both groups. This indicates that the level of variation in fraud indicators doesn't differ significantly between the two loan categories.
The fraud indicator values range from 0 to 8 in both groups, meaning that some loans have no fraud indicators at all while others have as many as eight. There are a few cases with much higher fraud indicator counts, which appear as outliers in the plot. These represent unusual loans with a significantly larger number of fraud indicators compared to the rest. In summary, while risky loans tend to have a slightly higher number of fraud indicators on average, the overall distribution of fraud indicators is very similar between the two groups. There is no major difference in how fraud indicators are spread between safe and risky loans.
cfindvrfy.nameaddressmatch¶
- Provides a high level indication of whether the name appears to belong with the address on the current application
plot_stacked_bar(clean_df, "cfindvrfy.nameaddressmatch")
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| cfindvrfy.nameaddressmatch | ||||
| match | 5590 | 45.959 | 6573 | 54.041 |
| mismatch | 6053 | 51.913 | 5607 | 48.087 |
| unavailable | 2103 | 51.068 | 2015 | 48.932 |
| partial | 1786 | 49.283 | 1838 | 50.717 |
| invalid | 365 | 50.624 | 356 | 49.376 |
| NaN | 16 | 61.538 | 10 | 38.462 |
The stacked bar chart illustrates how loans are split into safe and risky loans, based on whether the name and address match.
Most of the loans fall under the match and mismatch categories. In the match group, there are more safe loans (54.04%) than risky loans (45.96%), which means that when the name and address match, the loan is usually safer. However, in the mismatch group, there are more risky loans (51.91%) than safe loans (48.09%), suggesting that when the name and address don't match, there is a higher chance of risk.
For the unavailable and partial groups, the percentage of safe and risky loans is almost equal. The partial group has slightly more safe loans (50.72%) than risky loans (49.28%). The invalid` group is very balanced, but there are slightly more risky loans (50.62%). The NaN category, which means missing or unknown information, has more risky loans (61.54%) than safe loans (38.46%). This suggests that missing information could be linked to higher risk.
Overall, loans with a name and address match seem to be safer. If the information is missing or doesn't match, the loan is more likely to be risky.
cfindvrfy.overallmatchresult¶
- Provides a high level indication of whether key personal information from the current application appears to belong together
plot_stacked_bar(clean_df, "cfindvrfy.overallmatchresult", maxtickval = 24)
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| cfindvrfy.overallmatchresult | ||||
| partial | 11404 | 50.331 | 11254 | 49.669 |
| match | 4364 | 46.465 | 5028 | 53.535 |
| other | 94 | 54.335 | 79 | 45.665 |
| mismatch | 35 | 55.556 | 28 | 44.444 |
| NaN | 16 | 61.538 | 10 | 38.462 |
The chart shows the results of an overall match check for loans, comparing the number of safe and risky loans in different categories. The chart shows the results of an overall match check for loans, comparing the number of safe and risky loans in different categories.
Most loans fall into the partial and match categories. In the partial category, the percentage of risky loans (50.33%) and safe loans (49.67%) is almost the same. This means that when the match is only partial, it doesn't clearly show if a loan is risky or safe. In the match category, there are more safe loans (53.53%) than risky ones (46.47%). This suggests that a full match is more common for safer loans.
For the other category, risky loans (54.34%) are slightly more than safe loans (45.66%). This shows that when the match is unclear, the loan is a little more likely to be risky. The mismatch category has even more risky loans (55.56%) than safe ones (44.44%), meaning that if names and addresses don't match, the loan is more often risky.
The NaN category, which means missing data, has the highest number of risky loans (61.54%), while safe loans are only 38.46%. This shows that when important information is missing, there is a higher chance that the loan is risky.
Overall, the chart shows that loans with matching information tend to be less risky, while those with mismatched or missing information are more likely to be risky.
cfindvrfy.ssnnamematch¶
- Provides a high level indication of whether the SSN appears to belong with the name on the current application
plot_stacked_bar(clean_df, "cfindvrfy.ssnnamematch", maxtickval = 30)
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| cfindvrfy.ssnnamematch | ||||
| match | 14212 | 49.217 | 14664 | 50.783 |
| partial | 1046 | 48.924 | 1092 | 51.076 |
| mismatch | 518 | 49.007 | 539 | 50.993 |
| unavailable | 117 | 56.522 | 90 | 43.478 |
| NaN | 16 | 61.538 | 10 | 38.462 |
| invalid | 4 | 50.000 | 4 | 50.000 |
The stacked bar chart shows how loans are classified as risky or safe based on the ssnnamematch category.
The match category appears the most frequently, meaning most records have a matching SSN and name. In this group, the percentage of safe loans (50.78%) is slightly higher than risky loans (49.22%), showing a more balanced distribution.
Other categories, such as partial, mismatch, unavailable, NaN and invalid have fewer records. it's interesting to note that in the unavailable category, the percentage of risky loans (56.52%) is higher than safe loans (43.48%). This suggests that when the SSN-name information is missing or unavailable, loans are more likely to be seen as risky.
The mismatch category, where the SSN and name don't match, has nearly equal percentages of risky and safe loans, with 50.99% of loans classified as safe.
From this, we can see that having a correct SSN-name match is linked to lower risk, while missing or unavailable information might make a loan more likely to be classified as risky.
cfindvrfy.phonematchresult¶
- Provides a high level indication of whether the phone number appears to belong with the name and/or address on the current application
plot_stacked_bar(clean_df, "cfindvrfy.phonematchresult", maxtickval = 32)
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| cfindvrfy.phonematchresult | ||||
| unavailable | 15187 | 49.262 | 15642 | 50.738 |
| match | 343 | 48.107 | 370 | 51.893 |
| invalid | 201 | 45.270 | 243 | 54.730 |
| partial | 86 | 54.430 | 72 | 45.570 |
| mismatch | 76 | 56.716 | 58 | 43.284 |
| NaN | 20 | 58.824 | 14 | 41.176 |
The stacked bar plot shows how phone match results relate to loan classifications as either risky or safe. The most common category in the data is unavailable, meaning there was no phone match information. In this group, the proportion of risky loans (49.26%) and safe loans (50.74%) is almost equal. This suggests that not having phone match data doesn't strongly indicate whether a loan is risky or not.
For cases where the phone number doesn't match, the percentage of risky loans is higher, reaching 56.72%. Similarly, missing data (NaN) has an even greater proportion of risky loans at 58.82%. This suggests that when phone information is missing or incorrect, there is a greater chance that the loan is considered risky. On the other hand, when a phone match is found, the proportion of safe loans is slightly higher than risky loans. Even for invalid phone numbers, 54.73% of loans fall into the safe category, meaning that an incorrect phone number doesn't always mean a loan is more risky.
Overall, missing or mismatched phone numbers tend to be linked to a higher percentage of risky loans. However, having a valid phone match doesn't guarantee that a loan is safe, but it does seem to create a more balanced distribution between safe and risky loans.
cfindvrfy.overallmatchreasoncode¶
- 125 possible values provide details to support overall match result as stated in the provided
clarity_underwriting_dictionary.csvand `clarity_underwriting_dictionary.xlsx
display(Markdown(f'**cfindvrfy.overallmatchreasoncode has** **{clean_df["cfindvrfy.overallmatchreasoncode"].nunique(dropna = False)} unique values,**'
f' **including the missing values.**'))
# Calculate counts
cnt_pivot = clean_df.assign(cfindvrfy_overallmatchreasoncode = clean_df["cfindvrfy.overallmatchreasoncode"]
.cat.add_categories("Missing").fillna("Missing")) \
.pivot_table(index = "cfindvrfy_overallmatchreasoncode", columns = "target", aggfunc = "size", fill_value = 0, observed = False)
# Calculate proportions
prop_pivot = cnt_pivot.div(cnt_pivot.sum(axis = 1), axis = 0)
# Reset the index to work with sorting
prop_pivot = prop_pivot.reset_index()
# Sort by descending order of proportion in column "1" and ascending order of reason codes
prop_pivot = prop_pivot.sort_values(by = [1, "cfindvrfy_overallmatchreasoncode"],
ascending = [False, True])
# Set the index back to 'cfindvrfy.overallmatchreasoncode'
prop_pivot = prop_pivot.set_index("cfindvrfy_overallmatchreasoncode")
# Reorder the counts DataFrame to match the sorted proportions
cnt_pivot = cnt_pivot.loc[prop_pivot.index]
# Combine counts and proportion into a single DataFrame for annotation
annot = cnt_pivot.astype(str) + " (" + (prop_pivot * 100).round(1).astype(str) + "%)"
plt.figure(figsize = (10, 20))
ax = sns.heatmap(prop_pivot, annot = annot, fmt = "", cmap = "RdBu", cbar_kws = {"label": "Proportion"}, annot_kws = {"size": 8})
plt.title("Heatmap of cfindvrfy.overallmatchreasoncode by target", fontsize = 9)
plt.ylabel("cfindvrfy.overallmatchreasoncode", fontsize = 9)
plt.yticks(fontsize = 8)
plt.xlabel("Loans", fontsize = 9)
# Adjust x-tick labels to ensure they are centered
ax.set_xticks([0.5, 1.5]) # Set the tick positions in the middle of the columns
ax.set_xticklabels(["Safe", "Risky"]) # Set the labels
plt.show()
del cnt_pivot, prop_pivot, annot, ax;
cfindvrfy.overallmatchreasoncode has 74 unique values, including the missing values.
The heatmap shows how different cfindvrfy.overallmatchreasoncode relate to safe and risky loans. The values in each box show the count and percentage for each cfindvrfy.overallmatchreasoncode.
Despite the low counts for these codes, certaincfindvrfy.overallmatchreasoncode, like 43, 59 and 64 appear only in risky loans, while codes like 24, 33, 34, 35, 47, 69, 73 and 74 are found only in safe loans. This suggests that these codes are probably associated with riskier and safer loans respectively. Additionally, codes 27, 39, 54, 62 and 125 have equal proportions in both groups.
The deeper the blue color, the higher the proportion of a given cfindvrfy.overallmatchreasoncode in either safe or risky loans.
cfindvrfy.ssndobmatch¶
- Provides a high level indication of whether the Social Security Number appears to belong with the date of birth on the current application
plot_stacked_bar(clean_df, "cfindvrfy.ssndobmatch", maxtickval = 27)
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| cfindvrfy.ssndobmatch | ||||
| match | 12760 | 49.385 | 13078 | 50.615 |
| partial | 2250 | 50.167 | 2235 | 49.833 |
| invalid | 702 | 44.628 | 871 | 55.372 |
| mismatch | 151 | 46.605 | 173 | 53.395 |
| unavailable | 34 | 51.515 | 32 | 48.485 |
| NaN | 16 | 61.538 | 10 | 38.462 |
The chart shows how different categories of cfindvrfy.ssndobmatch are divided between safe and risky loans. The match category is the most common and appears almost equally in both safe and risky loans, with 50.62% in safe loans and 49.38% in risky loans.
The partial category also has a similar proportion of risky and safe loans. For the invalid category, there are more safe loans than risky ones. The mismatch category shows more safe loans than risky loans.
The unavailable category has slightly more risky loans compared to safe loans. Finally, the proportion of risky loans is highest in the absence of a matching code, compared to all other matching code categories.
In summary, loans with matching information are generally more balanced in risk. Loans with unavailable or NaN information though rare are more likely to be risky.
State against apr and loanAmount by target¶
- According to the correlation ratio:
stateandapris 0.8737stateandloanAmountis 0.6038
- According to the Spearman's R:
aprandloanAmountis -0.2127
sub_df = clean_df[["target", "state", "apr", "loanAmount"]].copy()
sub_df["target_label"] = sub_df["target"].map({0: "Safe", 1: "Risky"})
# Compute count data by state and target
cnt_df = sub_df.groupby(["state", "target_label"], observed = False).size().reset_index(name = "Count")
# Compute state-wise proportions and counts
prop_df = (cnt_df.pivot(index = "state", columns = "target_label", values = "Count")
.assign(Total = lambda df: df.sum(axis = 1))
.assign(**{"Risky Proportion": lambda df: df["Risky"] / df["Total"],
"Safe Proportion": lambda df: df["Safe"] / df["Total"]
})
.sort_values("Risky Proportion", ascending = False)
)
sorted_states = prop_df.index # Sorted order of states
# Convert 'state' to categorical with sorted order
sub_df["state"] = pd.Categorical(sub_df["state"], categories = sorted_states, ordered = True)
# Sort sub_df to ensure correct order in FacetGrid
sub_df = sub_df.sort_values("state")
# Create state-specific proportion and count labels
prop_texts = {state: (f'Risky: {prop_df.loc[state, "Risky"]: .0f} ({prop_df.loc[state, "Risky Proportion"]: .1%})\n'
f'Safe: {prop_df.loc[state, "Safe"]: .0f} ({prop_df.loc[state, "Safe Proportion"]:.1%})')
for state in sorted_states if state in prop_df.index
}
# FacetGrid plot (ensuring correct state order)
g = sns.FacetGrid(sub_df, col = "state", hue = "target_label", col_wrap = 3, height = 3,
palette = {"Safe": "green", "Risky": "red"},
col_order = sorted_states) # Ensure order is applied
g.map(sns.scatterplot, "apr", "loanAmount", s = 50, alpha = 0.3)
g.set_axis_labels("Loan APR (%)", "Loan Amount (USD)")
g.add_legend(title = "Loans", label_order = ["Safe", "Risky"])
# Annotate each subplot with correct state-specific proportions & counts
for ax, state in zip(g.axes.flat, sorted_states):
if state in prop_texts:
ax.text(0.4, 0.95, prop_texts[state],
transform = ax.transAxes, fontsize = 7, verticalalignment = "top",
bbox = dict(facecolor = "white", alpha = 0.2, edgecolor = "black"))
plt.show();
del prop_df, prop_texts, g;
The FacetGrid plots illustrate the relationship between loan amounts and interest rates across different states, arranged in descending order based on the proportion of risky loans. Each plot represents a state, with green dots indicating safe loans and red dots representing risky loans.
North Dakota (ND) has the highest proportion of risky loans at 66.7%, though it also has the fewest loans compared to other states. It's followed by Oklahoma (OK), Idaho (ID) and so on.
Colorado (CO) stands out for having the highest proportion of safe loans at 70.1%, mostly with lower interest rates. It's followed by Georgia (GA), Illinois (IL) and others.
New Jersey (NJ) and Rhode Island (RI) have an equal proportion of both safe and risky loans.
In most states, loans are concentrated at the higher end of APR and the lower end of loan amounts, indicating that smaller loans often come with higher interest rates. However, CO doesn't follow this trend, as loans there tend to have both lower amounts and lower APRs.
Interestingly, California (CA) displays two distinct loan clusters. The first cluster consists of loans with APRs between 100% and 300% and loan amounts mostly ranging from USD1,000 to USD4,000, with many loans around USD3,000. This group has a mix of both safe and risky loans. The second cluster is at the extreme high end of APR, between 500% and 600%, where loan amounts are typically USD1,000 to USD3,000, with many around USD2,000. Most of these loans are risky, with only a few classified as safe.
Apart from CA, Georgia (GA) is another state where loans are mostly concentrated in the lower APR and higher loan amount range.
Both safe and risky loans show up in every state and across all types. This means that whether a loan becomes risky is not just about the loan size or the interest rate. It also depends on other things like past payment patterns, how much debt is linked to the loan compared to income and what is happening in the economy at the time.
Overall, the plots indicate an inverse relationship between APR and loan amount where high-interest loans are more common in many states and tend to carry higher risk. In contrast, loans with lower interest rates show a mix of both safe and risky loans.
# Print unlimited number of rows by setting to None, default is 10
pd.set_option('display.max_rows', None)
# Galculate summary statistics by groups
sub_df.groupby(["state", "target_label"], observed = False).agg({"apr": ["count", "mean", "min", "median", "max", "std"],
"loanAmount": ["count", "mean", "min", "median", "max", "std"]})
# Reset to default setting
pd.reset_option("display.max_rows")
| apr | loanAmount | ||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | mean | min | median | max | std | count | mean | min | median | max | std | ||
| state | target_label | ||||||||||||
| AK | Risky | 15 | 630.333333 | 590.000 | 645.00 | 645.0 | 25.175574 | 15 | 640.000000 | 300.0 | 500.0 | 1500.0 | 393.791098 |
| Safe | 12 | 618.333333 | 490.000 | 645.00 | 645.0 | 47.258156 | 12 | 470.833333 | 300.0 | 400.0 | 900.0 | 168.493773 | |
| AL | Risky | 132 | 637.500000 | 590.000 | 645.00 | 645.0 | 18.946490 | 132 | 478.219697 | 300.0 | 400.0 | 1500.0 | 202.922477 |
| Safe | 104 | 633.894231 | 590.000 | 645.00 | 645.0 | 22.185851 | 104 | 510.096154 | 300.0 | 400.0 | 1500.0 | 262.769031 | |
| AZ | Risky | 260 | 633.304231 | 404.100 | 645.00 | 645.0 | 26.487933 | 260 | 592.834615 | 300.0 | 500.0 | 1500.0 | 294.090574 |
| Safe | 273 | 635.439560 | 565.000 | 645.00 | 645.0 | 21.056406 | 273 | 586.996337 | 300.0 | 500.0 | 1850.0 | 317.453893 | |
| CA | Risky | 813 | 426.786255 | 139.125 | 590.00 | 645.0 | 189.864231 | 813 | 1554.151292 | 300.0 | 800.0 | 3750.0 | 1185.799789 |
| Safe | 639 | 361.378247 | 135.150 | 242.00 | 645.0 | 194.951067 | 639 | 1928.979656 | 300.0 | 2600.0 | 4687.0 | 1190.613027 | |
| CO | Risky | 134 | 180.200000 | 180.200 | 180.20 | 180.2 | 0.000000 | 134 | 497.014925 | 400.0 | 500.0 | 500.0 | 14.716869 |
| Safe | 314 | 180.200000 | 180.200 | 180.20 | 180.2 | 0.000000 | 314 | 494.267516 | 400.0 | 500.0 | 500.0 | 19.958546 | |
| CT | Risky | 91 | 629.890110 | 590.000 | 645.00 | 645.0 | 24.686681 | 91 | 563.186813 | 300.0 | 500.0 | 2000.0 | 278.022298 |
| Safe | 100 | 621.850000 | 540.000 | 645.00 | 645.0 | 30.281966 | 100 | 609.750000 | 300.0 | 600.0 | 1500.0 | 253.532991 | |
| DE | Risky | 37 | 634.594595 | 590.000 | 645.00 | 645.0 | 21.838370 | 37 | 512.162162 | 300.0 | 400.0 | 1250.0 | 243.856574 |
| Safe | 32 | 634.687500 | 590.000 | 645.00 | 645.0 | 21.810677 | 32 | 520.312500 | 300.0 | 400.0 | 2000.0 | 320.183762 | |
| FL | Risky | 926 | 621.841253 | 565.000 | 645.00 | 645.0 | 27.249313 | 926 | 510.231102 | 300.0 | 400.0 | 2000.0 | 248.268618 |
| Safe | 722 | 625.380886 | 340.000 | 645.00 | 645.0 | 28.299475 | 722 | 525.242382 | 300.0 | 400.0 | 2000.0 | 265.316085 | |
| GA | Risky | 51 | 205.607843 | 95.000 | 217.00 | 251.0 | 47.815511 | 51 | 3318.627451 | 3100.0 | 3100.0 | 4000.0 | 259.029107 |
| Safe | 86 | 187.058140 | 95.000 | 182.00 | 251.0 | 47.657196 | 86 | 3345.639535 | 3100.0 | 3100.0 | 4375.0 | 337.969540 | |
| HI | Risky | 21 | 631.904762 | 590.000 | 645.00 | 645.0 | 24.003968 | 21 | 657.142857 | 300.0 | 500.0 | 1500.0 | 392.519335 |
| Safe | 23 | 633.043478 | 590.000 | 645.00 | 645.0 | 23.195764 | 23 | 659.782609 | 300.0 | 500.0 | 2000.0 | 489.244497 | |
| IA | Risky | 58 | 632.672414 | 590.000 | 645.00 | 645.0 | 23.136049 | 58 | 564.224138 | 300.0 | 500.0 | 1250.0 | 255.771625 |
| Safe | 44 | 633.295455 | 515.000 | 645.00 | 645.0 | 27.299868 | 44 | 644.318182 | 300.0 | 550.0 | 2000.0 | 370.451627 | |
| ID | Risky | 30 | 628.500000 | 590.000 | 645.00 | 645.0 | 25.635038 | 30 | 488.333333 | 300.0 | 500.0 | 1000.0 | 147.205775 |
| Safe | 16 | 603.124375 | 29.990 | 645.00 | 645.0 | 153.450364 | 16 | 529.375000 | 300.0 | 512.5 | 845.0 | 182.728168 | |
| IL | Risky | 1810 | 356.939227 | 288.000 | 360.00 | 590.0 | 16.092486 | 1810 | 561.783978 | 200.0 | 500.0 | 1875.0 | 282.595354 |
| Safe | 2767 | 355.539212 | 288.000 | 360.00 | 590.0 | 17.219537 | 2767 | 609.594507 | 200.0 | 500.0 | 1875.0 | 319.969919 | |
| IN | Risky | 718 | 598.165599 | 360.000 | 590.00 | 681.0 | 32.641126 | 718 | 570.029944 | 200.0 | 500.0 | 2000.0 | 291.392307 |
| Safe | 784 | 597.074298 | 472.000 | 590.00 | 681.0 | 33.770968 | 784 | 577.349490 | 200.0 | 500.0 | 2000.0 | 298.366314 | |
| KS | Risky | 67 | 633.507463 | 590.000 | 645.00 | 645.0 | 22.529693 | 67 | 542.537313 | 300.0 | 500.0 | 1500.0 | 276.648466 |
| Safe | 48 | 640.416667 | 590.000 | 645.00 | 645.0 | 15.362061 | 48 | 527.604167 | 300.0 | 400.0 | 1500.0 | 250.384788 | |
| KY | Risky | 142 | 627.183099 | 590.000 | 645.00 | 645.0 | 25.829946 | 142 | 551.408451 | 300.0 | 400.0 | 1800.0 | 322.209441 |
| Safe | 122 | 633.278689 | 590.000 | 645.00 | 645.0 | 22.615822 | 122 | 532.991803 | 300.0 | 400.0 | 1500.0 | 249.778133 | |
| LA | Risky | 121 | 633.884298 | 565.000 | 645.00 | 645.0 | 22.550754 | 121 | 543.388430 | 300.0 | 500.0 | 1500.0 | 258.194222 |
| Safe | 79 | 633.797468 | 390.000 | 645.00 | 645.0 | 34.298979 | 79 | 532.594937 | 300.0 | 500.0 | 1250.0 | 224.727557 | |
| MI | Risky | 751 | 598.736152 | 472.000 | 590.00 | 681.0 | 33.680476 | 751 | 538.806924 | 200.0 | 437.0 | 2343.0 | 276.913491 |
| Safe | 762 | 594.978346 | 472.000 | 590.00 | 681.0 | 33.718559 | 762 | 568.286089 | 200.0 | 500.0 | 2000.0 | 297.293489 | |
| MN | Risky | 119 | 632.521008 | 590.000 | 645.00 | 645.0 | 23.132576 | 119 | 596.848739 | 300.0 | 500.0 | 1500.0 | 268.954241 |
| Safe | 156 | 629.807692 | 390.000 | 645.00 | 645.0 | 30.831931 | 156 | 620.993590 | 300.0 | 500.0 | 1500.0 | 325.830165 | |
| MO | Risky | 954 | 514.539308 | 300.000 | 490.00 | 590.0 | 61.036656 | 954 | 574.474843 | 200.0 | 500.0 | 3000.0 | 353.885938 |
| Safe | 841 | 497.050981 | 300.000 | 490.00 | 590.0 | 63.924922 | 841 | 614.272295 | 200.0 | 500.0 | 3000.0 | 339.666192 | |
| MS | Risky | 102 | 636.372549 | 590.000 | 645.00 | 645.0 | 20.100698 | 102 | 488.970588 | 300.0 | 400.0 | 1500.0 | 195.195178 |
| Safe | 77 | 636.428571 | 590.000 | 645.00 | 645.0 | 20.079728 | 77 | 474.350649 | 300.0 | 400.0 | 1250.0 | 198.637382 | |
| NC | Risky | 624 | 593.486458 | 510.000 | 601.00 | 601.0 | 22.722564 | 624 | 670.453526 | 600.0 | 600.0 | 1500.0 | 116.342169 |
| Safe | 594 | 585.236263 | 449.990 | 601.00 | 601.0 | 31.078067 | 594 | 695.644781 | 300.0 | 600.0 | 1562.0 | 165.654174 | |
| ND | Risky | 12 | 645.000000 | 645.000 | 645.00 | 645.0 | 0.000000 | 12 | 635.416667 | 375.0 | 500.0 | 1250.0 | 301.220686 |
| Safe | 6 | 645.000000 | 645.000 | 645.00 | 645.0 | 0.000000 | 6 | 866.666667 | 300.0 | 750.0 | 1500.0 | 546.504041 | |
| NE | Risky | 29 | 629.827586 | 590.000 | 645.00 | 645.0 | 25.017235 | 29 | 613.793103 | 300.0 | 500.0 | 1500.0 | 326.752569 |
| Safe | 31 | 626.129032 | 490.000 | 645.00 | 645.0 | 41.687522 | 31 | 604.838710 | 300.0 | 500.0 | 1800.0 | 308.046219 | |
| NJ | Risky | 500 | 635.690000 | 565.000 | 645.00 | 645.0 | 20.773468 | 500 | 627.900000 | 300.0 | 500.0 | 2000.0 | 345.663212 |
| Safe | 500 | 634.150000 | 465.000 | 645.00 | 645.0 | 23.726639 | 500 | 674.200000 | 300.0 | 500.0 | 2000.0 | 382.514044 | |
| NM | Risky | 96 | 636.979167 | 590.000 | 645.00 | 645.0 | 19.513547 | 96 | 488.541667 | 300.0 | 400.0 | 1500.0 | 243.600100 |
| Safe | 111 | 641.531532 | 590.000 | 645.00 | 645.0 | 13.429831 | 111 | 544.144144 | 300.0 | 400.0 | 1500.0 | 293.489100 | |
| NV | Risky | 300 | 579.486067 | 449.990 | 590.00 | 645.0 | 62.605006 | 300 | 543.060000 | 200.0 | 500.0 | 1875.0 | 269.930816 |
| Safe | 206 | 574.524563 | 449.990 | 590.00 | 645.0 | 71.576892 | 206 | 646.660194 | 300.0 | 500.0 | 2000.0 | 369.879323 | |
| OH | Risky | 2638 | 590.194244 | 300.000 | 590.00 | 681.0 | 46.675299 | 2638 | 573.169447 | 200.0 | 500.0 | 2000.0 | 324.991681 |
| Safe | 2379 | 584.061927 | 300.000 | 590.00 | 681.0 | 47.357594 | 2379 | 604.376629 | 200.0 | 500.0 | 2343.0 | 340.651392 | |
| OK | Risky | 78 | 633.717949 | 590.000 | 645.00 | 645.0 | 22.352488 | 78 | 532.051282 | 300.0 | 450.0 | 1500.0 | 267.713304 |
| Safe | 40 | 640.875000 | 590.000 | 645.00 | 645.0 | 14.671073 | 40 | 582.500000 | 300.0 | 500.0 | 1500.0 | 282.230130 | |
| PA | Risky | 486 | 635.977366 | 440.000 | 645.00 | 645.0 | 21.888224 | 486 | 648.146091 | 300.0 | 500.0 | 2000.0 | 331.099015 |
| Safe | 590 | 632.949153 | 265.000 | 645.00 | 645.0 | 35.225889 | 590 | 606.100000 | 300.0 | 500.0 | 2000.0 | 325.988759 | |
| RI | Risky | 22 | 617.500000 | 590.000 | 617.50 | 645.0 | 28.147147 | 22 | 607.954545 | 300.0 | 500.0 | 1500.0 | 314.281717 |
| Safe | 22 | 635.000000 | 590.000 | 645.00 | 645.0 | 21.712406 | 22 | 476.136364 | 300.0 | 400.0 | 1000.0 | 196.468414 | |
| SC | Risky | 369 | 585.830623 | 44.000 | 601.00 | 601.0 | 54.351125 | 369 | 694.262873 | 300.0 | 700.0 | 1500.0 | 155.785069 |
| Safe | 278 | 580.339388 | 290.000 | 600.50 | 601.0 | 54.572270 | 278 | 700.892086 | 100.0 | 601.5 | 1500.0 | 176.974501 | |
| SD | Risky | 44 | 581.818182 | 525.000 | 590.00 | 645.0 | 44.839389 | 44 | 619.318182 | 300.0 | 500.0 | 1800.0 | 335.230901 |
| Safe | 39 | 533.365128 | 29.990 | 590.00 | 645.0 | 126.650412 | 39 | 682.051282 | 300.0 | 600.0 | 2000.0 | 384.483108 | |
| TN | Risky | 725 | 601.644483 | 501.500 | 590.00 | 681.0 | 30.453131 | 725 | 490.685517 | 200.0 | 400.0 | 2000.0 | 210.059641 |
| Safe | 458 | 594.222707 | 501.500 | 590.00 | 681.0 | 32.452212 | 458 | 534.423581 | 200.0 | 450.0 | 2343.0 | 296.655050 | |
| TX | Risky | 1107 | 626.295393 | 290.000 | 680.00 | 680.0 | 102.661879 | 1107 | 531.397471 | 200.0 | 400.0 | 2000.0 | 304.197603 |
| Safe | 1096 | 626.721706 | 290.000 | 680.00 | 681.0 | 111.145845 | 1096 | 597.398723 | 100.0 | 500.0 | 2000.0 | 345.850714 | |
| UT | Risky | 98 | 596.886633 | 325.000 | 645.00 | 645.0 | 68.609765 | 98 | 594.387755 | 300.0 | 500.0 | 1800.0 | 343.881172 |
| Safe | 91 | 609.065495 | 325.000 | 645.00 | 645.0 | 60.695811 | 91 | 728.021978 | 300.0 | 600.0 | 2000.0 | 398.344406 | |
| VA | Risky | 535 | 359.000000 | 359.000 | 359.00 | 359.0 | 0.000000 | 535 | 913.551402 | 400.0 | 700.0 | 1800.0 | 479.327815 |
| Safe | 764 | 359.289267 | 359.000 | 359.00 | 580.0 | 7.995499 | 764 | 819.010471 | 400.0 | 600.0 | 1800.0 | 406.083519 | |
| WA | Risky | 90 | 629.722222 | 590.000 | 645.00 | 645.0 | 24.772687 | 90 | 619.444444 | 300.0 | 500.0 | 1500.0 | 321.704942 |
| Safe | 95 | 621.736737 | 29.990 | 645.00 | 645.0 | 66.408652 | 95 | 640.684211 | 300.0 | 500.0 | 1550.0 | 331.232876 | |
| WI | Risky | 771 | 501.084827 | 300.000 | 449.99 | 681.0 | 73.390785 | 771 | 557.443580 | 200.0 | 437.0 | 2285.0 | 309.859407 |
| Safe | 1069 | 486.390219 | 300.000 | 449.99 | 590.0 | 73.281926 | 1069 | 568.929841 | 200.0 | 500.0 | 3000.0 | 322.339900 | |
| WY | Risky | 37 | 589.324324 | 525.000 | 590.00 | 645.0 | 41.166366 | 37 | 770.270270 | 300.0 | 700.0 | 2000.0 | 448.989523 |
| Safe | 29 | 587.974138 | 490.000 | 590.00 | 645.0 | 46.539171 | 29 | 781.034483 | 300.0 | 500.0 | 2000.0 | 506.636985 | |
leadCost and leadType by target¶
- According to the correlation ratio:
leadCostandleadTypeis 0.7084.
The lead type determines the underwriting rules for a lead:
bvMandatory: leads that are bought from the ping tree – required to perform bank verification before loan approvallead: very similar tobvMandatory, except bank verification is optional for loan approvalcalifornia: similar tolead, but optimized for California lending rulesorganic: customers that came through the MoneyLion websiterc_returning: customers who have at least 1 paid off loan in another loan portfolio. (The first paid off loan isn't in this data set).prescreen: preselected customers who have been offered a loan through direct mail campaignsexpress: promotional "express" loansrepeat: promotional loans offered through smsinstant-offer: promotional "instant-offer" loanslionpay
fig = px.box(clean_df, x = "leadCost", y = "leadType", color = "target",
labels = {"target": "Loans", "leadCost": "Lead Cost (USD)", "leadType": "Lead Type"},
title = "Box Plot of Lead Cost by Lead Type and Target",
category_orders = {"target": [0, 1]}, # Plot: "Red" bar in the top, "green" bar in the bottom
color_discrete_map = {0: "green", 1: "red"},
points = "suspectedoutliers", # Show suspected outliers, includes means
boxmode = "group")
fig.update_layout(title = {"x": 0.5, "font": {"size": 18, "weight": "bold"}},
width = 1200, height = 600, boxmode = "group", # Ensures grouped boxes don't overlap
boxgap = 0.4, # Adjusts spacing between each box
boxgroupgap = 0.5, # Adjusts spacing between groups of boxes
legend_traceorder = "reversed", # Shows "red" on top and "green" on the bottom in the legend
legend = dict(x = 0.85,
y = 0.95,
bgcolor = "rgba(255, 255, 255, 0.4)" # Adds a semi-transparent background
)
)
fig.update_traces(boxmean = True, opacity = 0.5) # Light gray whisker color (adjust alpha for transparency)
fig.for_each_trace(lambda t: t.update(name = "Safe" if t.name == "0" else "Risky"))
fig.show();
plot_stacked_bar(clean_df, "leadType", maxtickval = 30)
# Summary statistics
summary_df = clean_df.groupby(["leadType", "target"], observed = False)["leadCost"].describe().fillna(0).reset_index()
# Convert target values (0 -> "Safe, 1 -> "Risky")
summary_df["target"] = summary_df["target"].map({0: "Safe", 1: "Risky"})
display(summary_df)
del fig;
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| leadType | ||||
| bvMandatory | 8390 | 57.368 | 6235 | 42.632 |
| lead | 5019 | 44.689 | 6212 | 55.311 |
| organic | 1856 | 37.495 | 3094 | 62.505 |
| prescreen | 595 | 45.489 | 713 | 54.511 |
| rc_returning | 37 | 27.007 | 100 | 72.993 |
| california | 16 | 32.653 | 33 | 67.347 |
| instant-offer | 0 | 0.000 | 8 | 100.000 |
| lionpay | 0 | 0.000 | 2 | 100.000 |
| express | 0 | 0.000 | 1 | 100.000 |
| repeat | 0 | 0.000 | 1 | 100.000 |
| leadType | target | count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | bvMandatory | Safe | 6235.0 | 4.775461 | 2.406535 | 3.0 | 3.0 | 3.0 | 6.0 | 11.0 |
| 1 | bvMandatory | Risky | 8390.0 | 4.749702 | 2.392083 | 3.0 | 3.0 | 3.0 | 6.0 | 11.0 |
| 2 | california | Safe | 33.0 | 165.151515 | 28.735998 | 120.0 | 170.0 | 170.0 | 170.0 | 200.0 |
| 3 | california | Risky | 16.0 | 140.625000 | 47.953971 | 10.0 | 120.0 | 120.0 | 170.0 | 200.0 |
| 4 | express | Safe | 1.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 5 | instant-offer | Safe | 8.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 6 | lead | Safe | 6212.0 | 33.703799 | 25.664302 | 0.0 | 25.0 | 25.0 | 40.0 | 200.0 |
| 7 | lead | Risky | 5019.0 | 31.641761 | 24.839556 | 0.0 | 25.0 | 25.0 | 40.0 | 200.0 |
| 8 | lionpay | Safe | 2.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 9 | organic | Safe | 3094.0 | 0.136070 | 3.042097 | 0.0 | 0.0 | 0.0 | 0.0 | 115.0 |
| 10 | organic | Risky | 1856.0 | 0.221983 | 2.628123 | 0.0 | 0.0 | 0.0 | 0.0 | 75.0 |
| 11 | prescreen | Safe | 713.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 12 | prescreen | Risky | 595.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 13 | rc_returning | Safe | 100.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 14 | rc_returning | Risky | 37.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| 15 | repeat | Safe | 1.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
The two charts provide information on how different lead types perform based on their cost and risk. The first chart, a box-whisker plot, shows the distribution of lead costs and whether they are linked to safe or risky loans. The second chart, a stacked bar chart, represents the proportion of risky and safe loans for each lead type.
The box-whisker plot indicates that some lead types, including lionpay, express, instant-offer, rc-returning, repeat and prescreen, have a lead cost of zero. Among these, only rc-returning and prescreen contain risky loans, with 27% and 45.5% of loans categorized as risky respectively. The other four lead types consist entirely of safe loans, though they have a very low number of total loans.
The california lead type has the highest average cost, followed by lead and bvMandatory, regardless of whether the loans are safe or risky. The difference in lead cost between safe and risky loans is most noticeable in california compared to other lead types. About one-third of the loans in this category are risky. The mean lead cost for safe loans in california is USD165.15, while for risky loans, it's lower at USD140. However, the median lead cost for safe loans is only USD5 higher than the mean, while for risky loans, the median is USD20 lower than the mean. This difference suggests that the distribution of lead costs for safe loans is slightly skewed to the left, whereas for risky loans, it's skewed to the right.
The box-whisker plot shows that bvMandatory and lead have similar variability in lead cost, but lead has more outliers, regardless of loan risk. These two lead types account for the highest number of loans, with lead at 14,625 and bvMandatory at 11,231. However, they also have a high proportion of risky loans, at 57.4% and 44.7%, respectively. While these leads generate a significant number of loans, they also present a higher risk of financial loss.
The stacked bar chart shows that the organic lead type has a lower proportion of risky loans at 37.5%. Its distribution in the box-whisker plot suggests that its cost is relatively stable, which may indicate a safer investment for the lender compared to bvMandatory and lead.
The organic lead type appears to be a better option, as 62.5% of its loans are safe and it has one of the lowest and most stable costs. This makes it a less risky and more predictable choice compared to other lead types.
One way to handle this is to stop putting so much attention on leads that often end up risky. It might be smarter to spend more effort and money on the ones that usually work out better. Just because a lead is more expensive does not mean it is safer. That is why it is important to watch costs carefully. If some leads keep turning into too many risky loans, it may be better to use them less or even stop using them so the business does not lose money.
First Payment Status¶
plot_stacked_bar(clean_df, "fpStatus")
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| fpStatus | ||||
| Checked | 11263 | 41.639 | 15786 | 58.361 |
| Rejected | 4524 | 93.723 | 303 | 6.277 |
| Cancelled | 34 | 19.883 | 137 | 80.117 |
| NaN | 5 | 3.546 | 136 | 96.454 |
| Skipped | 87 | 71.901 | 34 | 28.099 |
| Pending | 0 | 0.000 | 3 | 100.000 |
The chart and table above give a simple view of how safe and risky loans appear at the stage of the first payment. When the first payment is Checked, most of the loans fall into the safe group, while a smaller but still noticeable share are risky. This means that a Checked first payment usually points toward safety, though it is not a guarantee.
When the first payment is Rejected, the pattern is much clearer. Almost all of these loans are risky, and only a very small number are safe. A rejected first payment strongly connects with risk.
Cancelled loans show something different. Most of the cancelled loans are safe, with only a small number being risky. This suggests that cancellation often happens for reasons unrelated to risk, and it does not necessarily reflect poor quality.
Skipped payments lean heavily toward risky loans. Most of the skipped loans are risky, while fewer are safe. This makes skipped payments appear more closely tied to risk. Pending loans are all safe, but the number is very small. The few loans with missing information (NaN) are also mostly safe but too limited to affect the overall results.
Overall, the picture is clear. Checked loans are mostly safe but include some risk. Rejected and Skipped loans are strongly linked to risk. Cancelled loans are mostly safe and do not show the same level of concern. These differences highlight how the first payment status gives an early sense of whether a loan is safe or risky.
hasCF¶
plot_stacked_bar(clean_df, "hasCF", maxtickval = 34)
- Summary Statistics:
| target | Risky | Safe | ||
|---|---|---|---|---|
| Counts | Proportion (%) | Counts | Proportion (%) | |
| hasCF | ||||
| True | 15913 | 49.248 | 16399 | 50.752 |
All the loans in the matched data have hasCF = True, indicating that this feature provides no discriminatory power for risk assessment, as the proportions of risky loans (49.25%) and safe loans (50.75%) are nearly identical. On top of that, there's no explanation of what hasCF actually means in the data dictionary.
Numerical features by target¶
- Acccording to the provided
clarity_underwriting_dictionary.xlsxorclarity_underwriting_dictionary.csv:- clear-fraud-stabilities
cfinq.oneminuteago(Correlation ratio = -0.0636)- Number of unique inquiries for the consumer seen by Clarity in the last 1 minute
cfinq.tenminutesago(Correlation ratio = -0.0101)- Number of unique inquiries for the consumer seen by Clarity in the last 10 minutes
cfinq.onehourago(Correlation ratio = 0.0116)- Number of unique inquiries for the consumer seen by Clarity in the last 1 hour
cfinq.twentyfourhoursago(Correlation ratio = 0.0404)- Number of unique inquiries for the consumer seen by Clarity in the last 24 hours
cfinq.sevendaysago(Correlation ratio = 0.0806)- Number of unique inquiries for the consumer seen by Clarity in the last 7 days
cfinq.fifteendaysago(Correlation ratio = 0.0985)- Number of unique inquiries for the consumer seen by Clarity in the last 15 days
cfinq.thirtydaysago(Correlation ratio = 0.1163)- Number of unique inquiries for the consumer seen by Clarity in the last 30 days
cfinq.ninetydaysagoCorrelation ratio = 0.1277- Number of unique inquiries for the consumer seen by Clarity in the last 90 days
cfinq.threesixtyfivedaysago(Correlation ratio = 0.1081)- Number of unique inquiries for the consumer seen by Clarity in the last 365 days
- clear-fraud-stabilities
nPaidOff(Correlation ratio = -0.1257)- How many MoneyLion loans this client has paid off in the past
originallyScheduledPaymentAmount(Correlation ratio = 0.0038)- Originally scheduled repayment amount (if a customer pays off all his scheduled payments, This's the amount we should receive)
loanAmount(Correlation ratio = 0.0919)
# Filter out rows with NaN in target or feature columns
sub_df = clean_df[["target",
"cfinq.oneminuteago", "cfinq.tenminutesago", "cfinq.onehourago",
"cfinq.twentyfourhoursago", "cfinq.sevendaysago", "cfinq.fifteendaysago",
"cfinq.thirtydaysago", "cfinq.ninetydaysago", "cfinq.threesixtyfivedaysago",
"nPaidOff", "originallyScheduledPaymentAmount", "loanAmount"]].dropna()
# Extract numerical features dynamically while preserving order
numerical_feat = sub_df.columns[1:].tolist() # Exclude "target"
fig, ax = plt.subplots(nrows = 4, ncols = 3, figsize = (30, 30))
for idx, feat in enumerate(numerical_feat):
lst0 = sub_df[sub_df["target"] == 0][feat].tolist()
lst1 = sub_df[sub_df["target"] == 1][feat].tolist()
cols = [lst0, lst1]
# Compute the subplot indices
row_idx = idx // 3
col_idx = idx % 3
# Create the box plot with mean markers
box = ax[row_idx, col_idx].boxplot(cols, notch = True, patch_artist = True, showmeans = True,
meanprops = {"marker": "s", "markerfacecolor": "white", "markeredgecolor": "Cyan"})
ax[row_idx, col_idx].yaxis.set_major_locator(MaxNLocator(integer = True))
ax[row_idx, col_idx].set_xticklabels(["Safe", "Risky"], size = 15)
ax[row_idx, col_idx].set_xlabel("Loans", size = 15)
ax[row_idx, col_idx].set_ylabel(feat, size = 15)
colors = ["#99FF99", "#FF9999"]
for patch, color in zip(box["boxes"], colors):
patch.set_facecolor(color)
# Add legend for median and mean markers
ax[row_idx, col_idx].legend([box["medians"][0], box["means"][0]], ["Median", "Mean"], loc = "upper right")
# Dynamically remove empty subplots
num_plots = len(numerical_feat)
num_rows, num_cols = ax.shape
for idx in range(num_plots, num_rows * num_cols):
fig.delaxes(ax.flatten()[idx]) # Remove unused axes
plt.show();
# Apply the same order used in numerical_feat (from the plots)
ordered_feat = ["target",
"cfinq.oneminuteago", "cfinq.tenminutesago", "cfinq.onehourago",
"cfinq.twentyfourhoursago", "cfinq.sevendaysago", "cfinq.fifteendaysago",
"cfinq.thirtydaysago", "cfinq.ninetydaysago", "cfinq.threesixtyfivedaysago",
"nPaidOff", "originallyScheduledPaymentAmount", "loanAmount"]
df = clean_df.copy()
df["target"] = clean_df["target"].replace([0, 1], [2, 2]) # For the `Total` row
df = pd.concat([df[ordered_feat], clean_df[ordered_feat]], ignore_index = True).groupby("target").describe(include = "all").sort_index()
# Function for formatting the summary statistics table
def sumstatsfmt(df):
df.rename(index = {0: "Safe", 1: "Risky", 2: "Total"},
columns = {"count": "n", "mean": "Mean", "std": "SD", "min": "Min",
"25%": "Q1", "50%": "Median", "75%": "Q3", "max": "Max"},
inplace = True)
fmts = {"n": "{:,.0f}", "Mean": "{:,.3f}", "SD": "{:,.3f}", "Min": "{:,.0f}", "Q1": "{:,.3f}",
"Median": "{:,.3f}", "Q3": "{:,.3f}", "Max": "{:,.0f}"}
for col, fmt in fmts.items():
df[col] = df[col].map(lambda x: fmt.format(x))
return df
# Summary statistics table
display(Markdown(f'**- Summary Statistics:**'))
df = df.unstack().unstack(1).sort_index(level=[0, 1]).rename_axis(index = ("Numerical features", "Loans"),
axis = 1)
df = df[["count", "mean", "std", "min", "25%", "50%", "75%", "max"]]
# Reorder columns explicitly
df = df.reindex(ordered_feat, level = 0)
with pd.option_context("display.max_rows", 70):
display(sumstatsfmt(df))
del sub_df, numerical_feat, fig, ax, idx, feat, lst0, lst1, cols, row_idx, col_idx, ordered_feat, df;
- Summary Statistics:
| n | Mean | SD | Min | Q1 | Median | Q3 | Max | ||
|---|---|---|---|---|---|---|---|---|---|
| Numerical features | Loans | ||||||||
| cfinq.oneminuteago | Safe | 16,399 | 2.392 | 1.411 | 0 | 1.000 | 3.000 | 3.000 | 12 |
| Risky | 15,912 | 2.242 | 1.450 | 0 | 1.000 | 2.000 | 3.000 | 14 | |
| Total | 32,311 | 2.318 | 1.432 | 0 | 1.000 | 3.000 | 3.000 | 14 | |
| cfinq.tenminutesago | Safe | 16,399 | 3.268 | 2.035 | 0 | 2.000 | 3.000 | 4.000 | 23 |
| Risky | 15,912 | 3.303 | 2.233 | 0 | 2.000 | 3.000 | 4.000 | 35 | |
| Total | 32,311 | 3.285 | 2.135 | 0 | 2.000 | 3.000 | 4.000 | 35 | |
| cfinq.onehourago | Safe | 16,399 | 3.939 | 2.606 | 0 | 3.000 | 3.000 | 5.000 | 33 |
| Risky | 15,912 | 4.080 | 2.862 | 0 | 3.000 | 3.000 | 5.000 | 35 | |
| Total | 32,311 | 4.008 | 2.736 | 0 | 3.000 | 3.000 | 5.000 | 35 | |
| cfinq.twentyfourhoursago | Safe | 16,399 | 4.485 | 3.221 | 0 | 3.000 | 3.000 | 5.000 | 48 |
| Risky | 15,912 | 4.776 | 3.522 | 0 | 3.000 | 4.000 | 6.000 | 60 | |
| Total | 32,311 | 4.628 | 3.376 | 0 | 3.000 | 3.000 | 5.000 | 60 | |
| cfinq.sevendaysago | Safe | 16,399 | 5.155 | 3.969 | 0 | 3.000 | 4.000 | 6.000 | 55 |
| Risky | 15,912 | 5.777 | 4.442 | 0 | 3.000 | 4.000 | 7.000 | 64 | |
| Total | 32,311 | 5.461 | 4.220 | 0 | 3.000 | 4.000 | 6.000 | 64 | |
| cfinq.fifteendaysago | Safe | 16,399 | 5.773 | 4.717 | 0 | 3.000 | 4.000 | 7.000 | 72 |
| Risky | 15,912 | 6.645 | 5.382 | 0 | 3.000 | 5.000 | 8.000 | 76 | |
| Total | 32,311 | 6.202 | 5.074 | 0 | 3.000 | 5.000 | 7.000 | 76 | |
| cfinq.thirtydaysago | Safe | 16,399 | 6.742 | 5.972 | 0 | 3.000 | 5.000 | 8.000 | 89 |
| Risky | 15,912 | 8.016 | 6.890 | 0 | 4.000 | 6.000 | 10.000 | 81 | |
| Total | 32,311 | 7.370 | 6.472 | 0 | 3.000 | 5.000 | 9.000 | 89 | |
| cfinq.ninetydaysago | Safe | 16,399 | 9.489 | 9.680 | 0 | 4.000 | 6.000 | 11.000 | 143 |
| Risky | 15,912 | 11.787 | 11.421 | 0 | 5.000 | 8.000 | 15.000 | 143 | |
| Total | 32,311 | 10.621 | 10.635 | 0 | 4.000 | 7.000 | 13.000 | 143 | |
| cfinq.threesixtyfivedaysago | Safe | 16,399 | 18.185 | 21.986 | 0 | 6.000 | 11.000 | 22.000 | 401 |
| Risky | 15,912 | 22.556 | 25.455 | 0 | 7.000 | 14.000 | 28.000 | 438 | |
| Total | 32,311 | 20.338 | 23.858 | 0 | 6.000 | 12.000 | 25.000 | 438 | |
| nPaidOff | Safe | 16,398 | 0.240 | 0.743 | 0 | 0.000 | 0.000 | 0.000 | 20 |
| Risky | 15,912 | 0.100 | 0.387 | 0 | 0.000 | 0.000 | 0.000 | 6 | |
| Total | 32,310 | 0.171 | 0.599 | 0 | 0.000 | 0.000 | 0.000 | 20 | |
| originallyScheduledPaymentAmount | Safe | 16,399 | 1,819.210 | 1,350.319 | 188 | 1,049.810 | 1,429.600 | 2,066.920 | 16,868 |
| Risky | 15,913 | 1,790.096 | 1,270.922 | 335 | 1,091.760 | 1,388.350 | 1,968.500 | 16,800 | |
| Total | 32,312 | 1,804.872 | 1,311.879 | 188 | 1,073.560 | 1,406.345 | 2,025.985 | 16,868 | |
| loanAmount | Safe | 16,399 | 674.425 | 509.336 | 100 | 400.000 | 500.000 | 750.000 | 4,687 |
| Risky | 15,913 | 637.725 | 480.830 | 200 | 375.000 | 500.000 | 700.000 | 4,000 | |
| Total | 32,312 | 656.351 | 495.835 | 100 | 400.000 | 500.000 | 700.000 | 4,687 |
The box plots, when viewed alongside the tables, clearly show that the biggest differences between safe and risky loan holders lie in how many MoneyLion loans they've successfully paid off in the past, and the number of unique inquiries recorded for them by Clarity.
Risky loan holders consistently have more inquiries over every period measured from the last 10 minutes all the way to the last 365 days. The gap doesn’t just exist, it grows with time. For example, in the last 30 days, risky loan holders averaged 8 inquiries compared to 6.7 for safe ones, and over 90 days, the difference gets even wider. Risky loan holders also show more unpredictable and extreme behavior, with much higher variability and some outliers making an exceptionally high number of inquiries like 438 in just 365 days. That’s more than one inquiry per day for an entire year.
But inquiry activity is only part of the picture. For both safe and risky loans, most cases show no history of paid-off loans with MoneyLion. That’s why the median and even the upper quartile are both zero for the two groups. Safe loans, though, are about twice as likely to have at least one paid-off loan compared to risky loans. On average, the numbers are higher for safe loans, even though the typical case is still zero. What really stands out is at the top end. In the safe loan group, a few cases show a long history of loans being paid off, with as many as twenty. In the risky loan group, the best record is much smaller, no more than six. This shows that the strongest repayment histories are only found with safe loans and not with risky ones.
Other features, like the loan amount or the size of scheduled payments, show only modest differences between the groups. Safe loan are slightly more likely to have larger scheduled payments and loan amounts, but these differences aren’t as pronounced as those shown by repayment history and inquiry patterns.
Repayment Alignment¶
- Based on
originallyScheduledPaymentAmountand total successful payments bytarget Repayment Alignment- Equal: total successful
paymentAmount=originallyScheduledPaymentAmount - Over: total successful
paymentAmount>originallyScheduledPaymentAmount - Under: "total successful
paymentAmount<originallyScheduledPaymentAmount
- Equal: total successful
repymt_align_df = clean_df.copy()
repymt_align_df["Repayment Alignment"] = np.where(repymt_align_df["paymentAmount_tot"] == repymt_align_df["originallyScheduledPaymentAmount"], "Equal",
np.where(repymt_align_df["paymentAmount_tot"] > repymt_align_df["originallyScheduledPaymentAmount"],
"Over",
"Under"))
fig = px.parallel_categories(repymt_align_df.assign(Target = repymt_align_df["target"]
.map({0: "Safe", 1: "Risk"}))
.rename(columns = {"loanStatus": "Current Loan Status"}),
dimensions = ["Repayment Alignment", "Current Loan Status", "Target"],
color = repymt_align_df["Repayment Alignment"].astype("category").cat.codes)
fig.update_layout(width = 1300, height = 700, font = dict(size = 13))
fig.show()
display(pd.crosstab(index = repymt_align_df["loanStatus"],
columns = [repymt_align_df["target"].map({0: "Safe", 1: "Risky"}),
repymt_align_df["Repayment Alignment"]],
margins = True,
margins_name = "Total",
rownames = ["Loan Status"],
colnames = ["Target", "Repayment Alignment"]).T)
del repymt_align_df, fig;
| Loan Status | CSR Voided New Loan | Charged Off | Charged Off Paid Off | Credit Return Void | Customer Voided New Loan | External Collection | Internal Collection | New Loan | Paid Off Loan | Pending Paid Off | Returned Item | Settled Bankruptcy | Settlement Paid Off | Withdrawn Application | Total | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Target | Repayment Alignment | |||||||||||||||
| Risky | Equal | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 1 |
| Over | 0 | 0 | 0 | 0 | 0 | 5 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 6 | |
| Under | 0 | 1 | 109 | 0 | 0 | 9330 | 5133 | 0 | 0 | 0 | 1051 | 282 | 0 | 0 | 15906 | |
| Safe | Equal | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 2329 | 1 | 0 | 0 | 2 | 0 | 2333 |
| Over | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 335 | 1 | 0 | 0 | 5 | 0 | 342 | |
| Under | 16 | 0 | 0 | 70 | 47 | 0 | 0 | 6527 | 6422 | 110 | 0 | 0 | 529 | 3 | 13724 | |
| Total | 16 | 1 | 109 | 70 | 47 | 9335 | 5134 | 6529 | 9086 | 112 | 1051 | 283 | 536 | 3 | 32312 |
When looking at the table and the Sankey diagram together, the overall story is clear. Safe loans usually turn out fine while risky loans mostly don't.
For safe loans, most records end either as paid off loans or as new loans. This happens even when the successful repayment amount is less than what was originally scheduled. That's a little surprising because underpayment would normally create more negative outcomes but here many still fall into positive categories. The Sankey diagram highlights this with large flows from underpayments into paid off loans and new loans.
For risky loans, the picture is very different. Most of them end in collections whether external or internal. The table shows thousands of risky loans with underpayments that were directed into collections and the diagram makes this stand out with thick streams flowing in that direction.
There're also some details that don't seem to line up neatly. One case shows equal repayment alignment yet the outcome's bankruptcy which feels inconsistent. Another detail is that only 348 loans fall into the overpayment category out of $> 32000$ total loans. That's $\approx 1\%$. It isn't literally just a few but it's still a very small fraction compared to the total number of loans. More cases where extra payments are recorded might be expected.
A reasonable explanation for these odd results could be timing. Payments aren't always settled instantly and loan statuses may be updated before all activity is fully processed. That could make some loans appear underpaid even though later payments arrived. It could also explain why an account that looks fully paid might still be shown as bankrupt.
So overall the main message is consistent. Safe loans are mostly resolved positively and risky loans mostly end in collections. The smaller odd cases are probably not real mistakes but are more likely due to the way payment records and loan statuses are updated at different times.
Save processed data¶
Save the DataFrame as Parquet instead of CSV because it's faster, more memory-efficient and works better across data science tools.
# Save as a parquet file
clean_df.to_parquet(f'{temp_dir}/clean_df.parquet', engine = "pyarrow")
# Check: Load parquet file
df = pd.read_parquet(f'{temp_dir}/clean_df.parquet', engine = "pyarrow")
df.info(verbose = "all")
<class 'pandas.core.frame.DataFrame'> RangeIndex: 32312 entries, 0 to 32311 Data columns (total 284 columns): # Column Dtype --- ------ ----- 0 cfinq.thirtydaysago Int32 1 cfinq.twentyfourhoursago Int32 2 cfinq.oneminuteago Int32 3 cfinq.onehourago Int32 4 cfinq.ninetydaysago Int32 5 cfinq.sevendaysago Int32 6 cfinq.tenminutesago Int32 7 cfinq.fifteendaysago Int32 8 cfinq.threesixtyfivedaysago Int32 9 cfind.inquiryonfilecurrentaddressconflict boolean 10 cfind.totalnumberoffraudindicators Int32 11 cfind.telephonenumberinconsistentwithaddress boolean 12 cfind.inquiryageyoungerthanssnissuedate boolean 13 cfind.onfileaddresscautious boolean 14 cfind.inquiryaddressnonresidential boolean 15 cfind.onfileaddresshighrisk boolean 16 cfind.ssnreportedmorefrequentlyforanother boolean 17 cfind.currentaddressreportedbytradeopenlt90days boolean 18 cfind.inputssninvalid boolean 19 cfind.inputssnissuedatecannotbeverified boolean 20 cfind.inquiryaddresscautious boolean 21 cfind.morethan3inquiriesinthelast30days boolean 22 cfind.onfileaddressnonresidential boolean 23 cfind.creditestablishedpriortossnissuedate boolean 24 cfind.driverlicenseformatinvalid boolean 25 cfind.inputssnrecordedasdeceased boolean 26 cfind.inquiryaddresshighrisk boolean 27 cfind.inquirycurrentaddressnotonfile boolean 28 cfind.bestonfilessnissuedatecannotbeverified boolean 29 cfind.highprobabilityssnbelongstoanother boolean 30 cfind.maxnumberofssnswithanybankaccount Int32 31 cfind.bestonfilessnrecordedasdeceased boolean 32 cfind.currentaddressreportedbynewtradeonly boolean 33 cfind.creditestablishedbeforeage18 boolean 34 cfind.telephonenumberinconsistentwithstate boolean 35 cfind.driverlicenseinconsistentwithonfile boolean 36 cfind.workphonepreviouslylistedascellphone boolean 37 cfind.workphonepreviouslylistedashomephone boolean 38 cfindvrfy.ssnnamematch category 39 cfindvrfy.nameaddressmatch category 40 cfindvrfy.phonematchtype category 41 cfindvrfy.phonematchresult category 42 cfindvrfy.overallmatchresult category 43 cfindvrfy.phonetype category 44 cfindvrfy.ssndobreasoncode category 45 cfindvrfy.ssnnamereasoncode category 46 cfindvrfy.nameaddressreasoncode category 47 cfindvrfy.ssndobmatch category 48 cfindvrfy.overallmatchreasoncode float64 49 clearfraudscore float64 50 underwritingid object 51 loanId object 52 anon_ssn object 53 payFrequency category 54 apr float64 55 applicationDate datetime64[ns] 56 originated boolean 57 originatedDate datetime64[ns] 58 nPaidOff Int32 59 approved boolean 60 isFunded boolean 61 loanStatus category 62 loanAmount float64 63 originallyScheduledPaymentAmount float64 64 state category 65 leadType category 66 leadCost float64 67 fpStatus category 68 clarityFraudId object 69 hasCF boolean 70 principal_tot float64 71 fees_tot float64 72 paymentAmount_tot float64 73 sum_days_btw_pymts float64 74 mean_days_btw_pymts float64 75 med_days_btw_pymts float64 76 std_days_btw_pymts float64 77 cnt_days_btw_pymts Int32 78 min_days_btw_pymts float64 79 max_days_btw_pymts float64 80 sum_fees_Cancelled float64 81 sum_fees_Checked float64 82 sum_fees_Complete float64 83 sum_fees_None float64 84 sum_fees_Pending float64 85 sum_fees_Rejected float64 86 sum_fees_Rejected Awaiting Retry float64 87 sum_fees_Returned float64 88 sum_fees_Skipped float64 89 sum_principal_Cancelled float64 90 sum_principal_Checked float64 91 sum_principal_Complete float64 92 sum_principal_None float64 93 sum_principal_Pending float64 94 sum_principal_Rejected float64 95 sum_principal_Rejected Awaiting Retry float64 96 sum_principal_Returned float64 97 sum_principal_Skipped float64 98 sum_pymtAmt_Cancelled float64 99 sum_pymtAmt_Checked float64 100 sum_pymtAmt_Complete float64 101 sum_pymtAmt_None float64 102 sum_pymtAmt_Pending float64 103 sum_pymtAmt_Rejected float64 104 sum_pymtAmt_Rejected Awaiting Retry float64 105 sum_pymtAmt_Returned float64 106 sum_pymtAmt_Skipped float64 107 mean_fees_Cancelled float64 108 mean_fees_Checked float64 109 mean_fees_Complete float64 110 mean_fees_None float64 111 mean_fees_Pending float64 112 mean_fees_Rejected float64 113 mean_fees_Rejected Awaiting Retry float64 114 mean_fees_Returned float64 115 mean_fees_Skipped float64 116 mean_principal_Cancelled float64 117 mean_principal_Checked float64 118 mean_principal_Complete float64 119 mean_principal_None float64 120 mean_principal_Pending float64 121 mean_principal_Rejected float64 122 mean_principal_Rejected Awaiting Retry float64 123 mean_principal_Returned float64 124 mean_principal_Skipped float64 125 mean_pymtAmt_Cancelled float64 126 mean_pymtAmt_Checked float64 127 mean_pymtAmt_Complete float64 128 mean_pymtAmt_None float64 129 mean_pymtAmt_Pending float64 130 mean_pymtAmt_Rejected float64 131 mean_pymtAmt_Rejected Awaiting Retry float64 132 mean_pymtAmt_Returned float64 133 mean_pymtAmt_Skipped float64 134 med_fees_Cancelled float64 135 med_fees_Checked float64 136 med_fees_Complete float64 137 med_fees_None float64 138 med_fees_Pending float64 139 med_fees_Rejected float64 140 med_fees_Rejected Awaiting Retry float64 141 med_fees_Returned float64 142 med_fees_Skipped float64 143 med_principal_Cancelled float64 144 med_principal_Checked float64 145 med_principal_Complete float64 146 med_principal_None float64 147 med_principal_Pending float64 148 med_principal_Rejected float64 149 med_principal_Rejected Awaiting Retry float64 150 med_principal_Returned float64 151 med_principal_Skipped float64 152 med_pymtAmt_Cancelled float64 153 med_pymtAmt_Checked float64 154 med_pymtAmt_Complete float64 155 med_pymtAmt_None float64 156 med_pymtAmt_Pending float64 157 med_pymtAmt_Rejected float64 158 med_pymtAmt_Rejected Awaiting Retry float64 159 med_pymtAmt_Returned float64 160 med_pymtAmt_Skipped float64 161 std_fees_Cancelled float64 162 std_fees_Checked float64 163 std_fees_None float64 164 std_fees_Pending float64 165 std_fees_Rejected float64 166 std_fees_Rejected Awaiting Retry float64 167 std_fees_Skipped float64 168 std_principal_Cancelled float64 169 std_principal_Checked float64 170 std_principal_None float64 171 std_principal_Pending float64 172 std_principal_Rejected float64 173 std_principal_Rejected Awaiting Retry float64 174 std_principal_Skipped float64 175 std_pymtAmt_Cancelled float64 176 std_pymtAmt_Checked float64 177 std_pymtAmt_None float64 178 std_pymtAmt_Pending float64 179 std_pymtAmt_Rejected float64 180 std_pymtAmt_Rejected Awaiting Retry float64 181 std_pymtAmt_Skipped float64 182 min_fees_Cancelled float64 183 min_fees_Checked float64 184 min_fees_Complete float64 185 min_fees_None float64 186 min_fees_Pending float64 187 min_fees_Rejected float64 188 min_fees_Rejected Awaiting Retry float64 189 min_fees_Returned float64 190 min_fees_Skipped float64 191 min_principal_Cancelled float64 192 min_principal_Checked float64 193 min_principal_Complete float64 194 min_principal_None float64 195 min_principal_Pending float64 196 min_principal_Rejected float64 197 min_principal_Rejected Awaiting Retry float64 198 min_principal_Returned float64 199 min_principal_Skipped float64 200 min_pymtAmt_Cancelled float64 201 min_pymtAmt_Checked float64 202 min_pymtAmt_Complete float64 203 min_pymtAmt_None float64 204 min_pymtAmt_Pending float64 205 min_pymtAmt_Rejected float64 206 min_pymtAmt_Rejected Awaiting Retry float64 207 min_pymtAmt_Returned float64 208 min_pymtAmt_Skipped float64 209 max_fees_Cancelled float64 210 max_fees_Checked float64 211 max_fees_Complete float64 212 max_fees_None float64 213 max_fees_Pending float64 214 max_fees_Rejected float64 215 max_fees_Rejected Awaiting Retry float64 216 max_fees_Returned float64 217 max_fees_Skipped float64 218 max_principal_Cancelled float64 219 max_principal_Checked float64 220 max_principal_Complete float64 221 max_principal_None float64 222 max_principal_Pending float64 223 max_principal_Rejected float64 224 max_principal_Rejected Awaiting Retry float64 225 max_principal_Returned float64 226 max_principal_Skipped float64 227 max_pymtAmt_Cancelled float64 228 max_pymtAmt_Checked float64 229 max_pymtAmt_Complete float64 230 max_pymtAmt_None float64 231 max_pymtAmt_Pending float64 232 max_pymtAmt_Rejected float64 233 max_pymtAmt_Rejected Awaiting Retry float64 234 max_pymtAmt_Returned float64 235 max_pymtAmt_Skipped float64 236 cnt_custom Int32 237 cnt_non custom Int32 238 cnt_pymtStatus_Cancelled Int32 239 cnt_pymtStatus_Checked Int32 240 cnt_pymtStatus_Complete Int32 241 cnt_pymtStatus_None Int32 242 cnt_pymtStatus_Pending Int32 243 cnt_pymtStatus_Rejected Int32 244 cnt_pymtStatus_Rejected Awaiting Retry Int32 245 cnt_pymtStatus_Returned Int32 246 cnt_pymtStatus_Skipped Int32 247 cnt_pymtRCode_C01 Int32 248 cnt_pymtRCode_C02 Int32 249 cnt_pymtRCode_C03 Int32 250 cnt_pymtRCode_C05 Int32 251 cnt_pymtRCode_C07 Int32 252 cnt_pymtRCode_LPP01 Int32 253 cnt_pymtRCode_MISSED Int32 254 cnt_pymtRCode_R01 Int32 255 cnt_pymtRCode_R02 Int32 256 cnt_pymtRCode_R03 Int32 257 cnt_pymtRCode_R04 Int32 258 cnt_pymtRCode_R06 Int32 259 cnt_pymtRCode_R07 Int32 260 cnt_pymtRCode_R08 Int32 261 cnt_pymtRCode_R09 Int32 262 cnt_pymtRCode_R10 Int32 263 cnt_pymtRCode_R13 Int32 264 cnt_pymtRCode_R15 Int32 265 cnt_pymtRCode_R16 Int32 266 cnt_pymtRCode_R19 Int32 267 cnt_pymtRCode_R20 Int32 268 cnt_pymtRCode_R29 Int32 269 cnt_pymtRCode_R99 Int32 270 cnt_pymtRCode_RAF Int32 271 cnt_pymtRCode_RBW Int32 272 cnt_pymtRCode_RFG Int32 273 cnt_pymtRCode_RIR Int32 274 cnt_pymtRCode_RUP Int32 275 cnt_pymtRCode_RWC Int32 276 cnt_pymtRCode_RXL Int32 277 cnt_pymtRCode_RXS Int32 278 fpymtDate datetime64[ns] 279 fpymtAmt float64 280 fpymtStatus category 281 target Int8 282 yr_mth period[M] 283 mth int64 dtypes: Int32(55), Int8(1), boolean(31), category(16), datetime64[ns](3), float64(172), int64(1), object(4), period[M](1) memory usage: 55.6+ MB
Session Information¶
Log the full session environment including OS, CPU, Python version and loaded modules to support reproducibility and assist in debugging environment-specific issues.
import importlib
importlib.metadata.version("markupsafe")
'3.0.2'
display(Markdown(f"<span style = 'font-size: 18px; font-weight: bold;'> Session Information </span>"))
# https://pypi.org/project/session-info/
session_info.show(na = True, os = True, cpu = True, jupyter = True, dependencies = True,
std_lib = True, private = True, write_req_file = False, req_file_name = None, html = None
)
Session Information
C:\Users\grace\AppData\Local\Programs\Python\Python311\Lib\site-packages\session_info\main.py:213: UserWarning:
The '__version__' attribute is deprecated and will be removed in MarkupSafe 3.1. Use feature detection, or `importlib.metadata.version("markupsafe")`, instead.
Click to view session information
----- __main__ NA _functools NA calendar NA collections NA dython 0.7.9 gc NA importlib NA io NA itertools NA matplotlib 3.10.0 multiprocessing NA numpy 1.26.4 os NA pandas 2.2.3 pathlib NA platform 1.0.8 plotly 5.24.1 psutil 6.1.1 seaborn 0.13.2 session_info v1.0.1 subprocess NA sys 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] urllib NA -----
Click to view modules imported as dependencies
PIL 10.4.0 __future__ NA __mp_main__ NA _abc NA _ast NA _asyncio NA _bisect NA _blake2 NA _bz2 NA _codecs NA _collections NA _collections_abc NA _compat_pickle NA _compression NA _contextvars NA _csparsetools NA _csv 1.0 _ctypes 1.1.0 _cython_3_0_10 NA _cython_3_0_11 NA _cython_3_0_8 NA _cython_3_1_0a0 NA _datetime NA _decimal 1.70 _distutils_hack NA _elementtree NA _frozen_importlib NA _frozen_importlib_external NA _hashlib NA _heapq NA _imp NA _io NA _json NA _locale NA _loss NA _lsprof NA _lzma NA _moduleTNC NA _multibytecodec NA _multiprocessing NA _ni_label NA _opcode NA _operator NA _overlapped NA _pickle NA _plotly_utils NA _pydev_bundle NA _pydev_runfiles NA _pydevd_bundle NA _pydevd_frame_eval NA _pydevd_sys_monitoring NA _queue NA _random NA _sha512 NA _signal NA _sitebuiltins NA _socket NA _sqlite3 2.6.0 _sre NA _ssl NA _stat NA _statistics NA _string NA _strptime NA _struct NA _thread NA _typing NA _uuid NA _warnings NA _weakref NA _weakrefset NA _win32sysloader NA _winapi NA _zoneinfo NA abc NA anyio NA argparse 1.1 array NA arrow 1.3.0 ast NA asttokens NA asyncio NA atexit NA attr 24.3.0 attrs 24.3.0 babel 2.16.0 backports NA base64 NA bdb NA binascii NA bisect NA bz2 NA cProfile NA certifi 2024.12.14 cffi 1.17.1 charset_normalizer 3.4.1 cloudpickle 3.1.1 cmath NA cmd NA code NA codecs NA codeop NA colorama 0.4.6 colorsys NA comm 0.2.2 concurrent NA contextlib NA contextvars NA copy NA copyreg NA csv 1.0 ctypes 1.1.0 cycler 0.12.1 cython_runtime NA dataclasses NA datetime NA dateutil 2.9.0.post0 debugpy 1.8.11 decimal 1.70 decorator 5.1.1 defusedxml 0.7.1 difflib NA dis NA email NA encodings NA enum NA errno NA executing 2.1.0 fastjsonschema NA faulthandler NA filecmp NA fnmatch NA fqdn NA fractions NA functools NA genericpath NA getopt NA getpass NA gettext NA glob NA google NA gzip NA hashlib NA heapq NA hmac NA html NA http NA idna 3.10 inspect NA ipaddress 1.0 ipykernel 6.29.5 ipywidgets 8.1.5 isoduration NA jaraco NA jedi 0.19.2 jinja2 3.1.5 joblib 1.4.2 json 2.0.9 json5 0.10.0 jsonpointer 3.0.0 jsonschema 4.23.0 jsonschema_specifications NA jupyter_events 0.11.0 jupyter_server 2.15.0 jupyterlab_server 2.27.3 kaleido 0.2.1 keyword NA kiwisolver 1.4.7 linecache NA locale NA logging 0.5.1.2 lzma NA markupsafe 3.0.2 marshal 4 math NA matplotlib_inline 0.1.7 mimetypes NA mmap NA more_itertools 10.3.0 mpl_toolkits NA msvcrt NA nbformat 5.10.4 nt NA ntpath NA nturl2path NA numbers NA numexpr 2.10.2 opcode NA operator NA overrides NA packaging 24.2 parso 0.8.4 patsy 1.0.1 pdb NA pickle NA pkg_resources NA pkgutil NA platformdirs 4.3.6 plistlib NA posixpath NA pprint NA profile NA prometheus_client NA prompt_toolkit 3.0.48 pstats NA pure_eval 0.2.3 pyarrow 18.1.0 pydev_ipython NA pydevconsole NA pydevd 3.2.3 pydevd_file_utils NA pydevd_plugins NA pydevd_tracing NA pydoc NA pydoc_data NA pyexpat NA pygments 2.19.1 pyparsing 3.2.0 pythoncom NA pythonjsonlogger NA pytz 2024.2 pywin32_bootstrap NA pywin32_system32 NA pywintypes NA queue NA quopri NA random NA re 2.2.1 referencing NA reprlib NA requests 2.32.3 rfc3339_validator 0.1.4 rfc3986_validator 0.1.1 rpds NA runpy NA scipy 1.13.1 secrets NA select NA selectors NA send2trash NA shlex NA shutil NA signal NA site NA six 1.17.0 sklearn 1.6.1 sniffio 1.3.1 socket NA socketserver 0.4 sqlite3 2.6.0 ssl NA stack_data 0.6.3 stat NA statistics NA statsmodels 0.14.4 string NA stringprep NA struct NA sysconfig NA tarfile 0.9.0 tempfile NA tenacity NA textwrap NA threading NA threadpoolctl 3.5.0 time NA timeit NA token NA tokenize NA tornado 6.4.2 traceback NA traitlets 5.14.3 types NA typing NA typing_extensions NA unicodedata NA uri_template NA urllib3 2.3.0 uuid NA warnings NA wcwidth 0.2.13 weakref NA webbrowser NA webcolors NA websocket 1.8.0 win32api NA win32com NA win32con NA win32trace NA winerror NA winreg NA wsgiref NA xarray 2025.1.2 xml NA xmlrpc NA yaml 6.0.2 zipfile NA zipimport NA zlib 1.0 zmq 26.2.0 zoneinfo NA
----- IPython 8.31.0 jupyter_client 8.6.3 jupyter_core 5.7.2 jupyterlab 4.3.4 notebook 7.3.2 ----- Python 3.11.2 (tags/v3.11.2:878ead1, Feb 7 2023, 16:38:35) [MSC v.1934 64 bit (AMD64)] Windows-10-10.0.22631-SP0 8 logical CPU cores, Intel64 Family 6 Model 140 Stepping 1, GenuineIntel ----- Session information updated at 2025-09-23 15:42